One of our dedicated server customers recently had a problem with the machine keeling over and dying for a few days in a row, for no apparent reason. This necessitated a remote reboot of the server to get it running again (we cut the power to both power supplies for a few seconds). The immediate suspicion was faulty hardware, but this should rarely be the case as we put our hardware through a thorough “burn in” period before it’s ever deployed. In addition to this, it was happening pretty regularly in the middle of the day.
After spotting this pattern, a quick look at our trending graphs showed us the problem very clearly. The machine was steadily using all available physical memory. Once this ran out, the system starts pushing data off to swapspace, which is on the disks. Once this runs out the OS simply has no more memory to give. It’ll start killing off processes in a desperate attempt to reclaim some memory, but this is rarely successful.
Curiously, on days that the server didn’t die we saw a chunk of memory being released early in the morning. This pointed to a maintenance cronjob doing something, but what?
This should more accurately be described as a graph of non-critical memory allocation; buffers and caches can be quickly discarded to free up memory for applications that really need it. Note the sawtooth wave pattern with a period of 1 day.
Swapspace usage can be seen peaking multiple times over the course of a few days, corresponding to times when the server resembled a glorious fireball
We were now well-prepared to catch the system in the act. We kept an eye on trending graphs and waited for the memory usage get out of hand. When we logged in, what we saw was surprising. Apache was consuming vast amounts of memory, highly irregular for a server which has the primary function of a mail server. Having a look at apache’s logs, we noticed a lot of failed requests to Zpush, an open-source implementation of the MS Exchange protocol for syncing data to mobile clients (iPhones, in this case), written in PHP.
PHP Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 727553 bytes) in /home/zpush/zpush/include/mimeDecode.php on line 292
PHP is usually a pretty sturdy beast. It sanitises the environment between requests, effectively sandboxing different users’ code from each other. It’s certainly possible to write pathologically broken code, but generally this shouldn’t affect anyone badly except for yourself.
For whatever reason, zpush was exceeding its local memory limits in the backend MIME parser and dying. This is a condition that PHP should recover from gracefully, but that clearly wasn’t happening. With zpush being regularly called by the customer’s iPhones, this was leaking a little bit of memory every single time. Because apache wasn’t dying outright, the memory never got freed.
A whole lot of things suddenly made sense. Apache will eventually recycle its processes (after 4,000 requests), but on a lightly-used server like this one, apache was only dealing with zpush requests. Spread over 8 child processes, apache could in theory take some 32,000 hits before it gets close to saving itself. This also had a clean explanation for the early morning respite: logrotate will signal apache to do a graceful restart after processing the logs for the day, which will restart the child processes, freeing the memory.
In this instance, the most straighforward solution was to disable zpush. If this weren’t an option then further investigation would be needed. Increasing the PHP memory limit beyond 16MB in php.ini would do the job, but we’d really like to know why it’s trying to allocate so much memory.