What I learned today
Today I had a client emergency. Their website response times had been deteriorating lately, and had hit a critical spot today, causing cascading failures in their systems. When I logged into the database server and started watching the processes, it became clear fairly quickly the server was either overloaded or spinning it gears, and causing the website to slow.
I looked at the usual culprits for system slowdown (disk space, runaway processes, excessive memory consumption, swap space issues), with no success in finding the problem. I started watching the database processes for bad (unoptimized) queries and found several locked, waiting for another query to complete.
A little digging later and I found a sessions query was locking the system. Puzzled, because this query is optimized and very simple, I went to clear old sessions from the sessions table.
To realize there were around 670k sessions in the table.
There should be a few thousand. That there would be this many was momentarily puzzling. They're cleared out on a regular basis. I started a delete query to remove all of the anonymous user sessions.
I have so many watches and checks and alerts on these systems, that I receive warnings (email, SMS, IM) when it hiccups, how could it have gotten this far?
As I watched, the system deteriorated so that I couldn't become root, nor could Mike or I log in remotely. I had to go to the colocation facility and the computer directly to log in. Off I went, and twenty minutes I was at the colo. In that twenty minutes, and it did take that long, the delete sessions query finished, and the system was humming along again.
I drove back to the office puzzled.
And then it hit me.
The website receives a frequently scheduled hit to a page that triggers the database clean up. That scheduled process was sending me email on a daily basis. I was ignoring them because, if everything is running properly, I shouldn't receive them.
This was the problem. The scheduled process wasn't cleaning out the database tables as needed, and I was ignoring the error, and the sessions table grew to an unsustainable size for the frequency it's accessed.
Lesson learned: listen to the small errors before they grow into big ones.