Interesting failure modes, episode 2501
I got woken up by a SMS for low diskspace the other night on one of our customer’s servers. Okay, so that’s a lie, I never sleep, but the SMS is real.
Oh great, they’re making whoopie on their mailing lists again and making some stupidly huge logfile.
Little did I know just how huge that file was. How about 735gb huge, in the space of 12hrs? This customer is already a bit of an oddball, what with 1.4TiB of usable space in their server. “Oh that’s nothing”, you say. Sure, I’ve got a few TiB of kitten pictures on my machine at home, just like you, but to put things in perspective: 300GiB of space would be “big” for most Anchor customers. SCSI disks cost about $1.70/Gb, compared to about 10c/Gb for SATA.
There was no mailout. No big processing job, and no flood of activity. With a little digging I was able to nail it down to an apache errorlog file. That was a surprise, except for the PHP errors all throughout – some things never change.
[Fri Oct 02 02:39:57 2009] [error]PHP Warning: fgets(): supplied argument is not a valid stream resource in /home/wright/public_html/script.php on line 15, referer: XXX
Nice work there, guys. You need to learn to check your return values from failure-prone functions.
Strangely, there were no actual active connections, but the process list showed two apache processes going balls to the wall, writing the same error message to the log file ad infinitum. By my reckoning that was over 9000 lines per second – nothing a quick service-restart couldn’t fix, thankfully.
And to actually fix the problem? It’s tempting to dump the file, but we don’t like doing that; it’s just a bit too cowboy for us. I settled for a forced logrotate run, taking about 4hrs and squishing it down to just 4.3GiB – Crisis (and sleep) Averted.