Highly available infrastructure for your own website
Every site is different, so this isn’t so much a tutorial as some things to watch out for. We’ll take a reasonably representative database-backed site and talk about what changes when we make it highly available.
For the purposes of demonstration we’ll use Magento, an e-commerce website written in PHP with a MySQL backend. As well as exemplifying the popular LAMP pattern, Magento allows for extensions that uses extra software components, which also need to be taken into consideration in a highly available setup.
It’s worth noting that these notes apply even to vastly different systems. Taking some big customers that we’ve worked on as examples, Github is effectively a Rails app and Testflight’s core is mostly Django – the problem is approached in the same way.
Types of problems you’ll face
The approach we’re taking is to separate the moving parts and make each one highly available. This has the benefit of making the system more scalable in the process.
The parts we’re dealing with are:
- Webserver frontend
- Database backend
- Add-on components
- Load balancer, necessary for running the frontends
The webserver tier is generally the simplest to scale – just deploy more webservers and put them behind a load balancer. The catch here lies in keeping everything in sync, and sharing state between the servers.
Rolling out your codebase
Your site will need periodic updates for bugfixes and the occasional new feature, and this applies to any site. Content systems like Magento and WordPress generally have a one-click method to apply these, while something written in-house might use Capistrano, or something as simple as a subversion/git checkout from a repo.
When these occur, you’ll generally want to do some testing and allow for a clean changeover, and having a load balancer makes this very straightforward. To perform a change/upgrade on each frontend: manually remove it from the load balancer, apply the update, then reinsert it into the load balancer.
This way the end user never sees a half-ready server, and you can perform some testing before reinsertion if needed. If you like, you can extend this to do “blue-green deployments“.
Magento, along with pretty much every substantial website, uses sessions to remember users and provide a continuous browsing/shopping experience. This data is commonly stored on the server in files, tied to a cookie that the client keeps.
This breaks if you start using multiple servers as requests will tend to be spread over all the servers, resulting in inconsistent state depending on which server the user happens to reach, and a broken experience overall. Some apps can store state in the client itself, but this tends to be inefficient as more data is transferred with every single request.
The solution to this is to share session data between all the servers. The most common approach is to store sessions in a shared database, in effect turning a “file problem” into a “database problem”, which we’ll deal with in the next section.
Memcached is a simple, high-performance and widely-used database used for sharing session data between frontends. You run a single instance of memcached on the network and have all the frontends connect to it. The primary downside of memcached is that it’s purely in-memory – if your memcached server ever crashes, you lose all session data. It’s not the end of the world, but it makes for a poor experience if a customer is in the middle of a payment transaction.
Specifically for Magento, we’ve found these extensions for storing data in Redis, Cm_RedisSession and Cm_Cache_Backend_Redis. We love Redis because it behaves well and stores data to disk, unlike memcached. That’s a win in our books, and is ideal for HA Magento.
An alternative offered by load balancers is “sticky sessions”, which ensures that a given client always hits the same frontend server. We’re not fans of shifting persistence to the load balancer as it doesn’t actually make for proper seamless HA, and can have problems scaling up. Sticky sessions will also need to be expired from the load balancer (it has a finite amount of memory), and you’ll run into mismatches with the website’s idea of sessions.
Generated files and user uploads
A related issue is handling uploaded user content, most commonly image files. These need to go to some form of shared storage, along with any thumbnails that the site is likely to generate. A shared filesystem, such as NFS, is an easy way to do this. Another option is a clustered filesystem such a GFS or OCFS2. However you do this, the shared storage also needs to be highly available.
As a final point, you’ll want to keep the webserver (apache, nginx, etc.) config consistent across all the servers. We use Puppet for config management and automation, which makes things super simple. If you’re not doing something similar, you’re going to have a bad time.
So now you’ve got your frontends scaling out nicely and storing data in a database or shared filesystem. Now you need to make the storage layer highly available.
If you’ve been reading our HA articles, you’ll know that Corosync and Pacemaker are the way to go. Databases and filesystems are backed by DRBD storage, with a standby server ready to take over if the active server goes up in smoke.
This is our general formula for anything that needs on-disk storage; everything above the block device is just a service, which can be stopped and started on another node to effect a failover. This works great for NFS, MySQL, PostgreSQL and Redis, as well as more exotic things like AMQP servers.
The above sections cover almost all the problems you’re likely to run into. Even so, we went hunting to look for other problems you might face. Something we thought was interesting was integration with frontend caches and CDNs.
One particular piece of software we’ve worked with is Varnish, a high-performance caching proxy designed to reduce the load on webservers, which tend to serve large amounts of static content. Caching can be hard to get right, especially so in a distributed environment. Care needs to be taken to ensure that content is correctly cached, without inadvertently leaking sessions between users.
Dealing with CDNs is more closely related to handling server-generated files. If using such an extension for Magento, you’ll want to test that things behave properly when it comes to pushing content to the CDNs, with particular attention paid to any versioning that the extension performs.
The load balancer isn’t too special on its own, beyond the necessary HA-ification. Our preference is to run a pair of load balancers, each one a virtual machine to allow for easy scaling, with ldirector on each, and Pacemaker-managed virtual-IP for each service.
There’s a particular caveat when it comes to dealing with load balancers and source IP addresses: if your load balancer performs NAT before forwarding traffic to the frontend servers, the application will see the load balancer as the source-address instead of the client’s real address. This can be worked around if the load balancer adds an X-Forwarded-For header to the incoming request, but we prefer to use LVS’ Direct Routing mechanism and avoid the problem entirely.
There’s a lot to take into consideration if you’re planning to deploy a highly available website. Some sites can be patched up fairly easily, while for others it’ll be a lot of work. A well-written and architected site can make this easier, but it tends to come at the cost of added complexity, and can be harder to maintain in future.
Next time we’ll talk about some of the realities of an HA deployment. High availability is a great concept, but it doesn’t come for free, and in many cases it’s hard to argue that it’s worth it. What’s your website worth to you, and how much disruption would you really be willing to tolerate?