We’ve recently found ourselves in the situation where we’re managing a highly-available storage service for a customer without actually having direct access to the data on the server. HA storage is commonplace for us now, but not having access to the data is unusual, particularly as a full-stack hosting provider.
The reason for this is that the server is presenting iSCSI targets, effectively networked block devices for the clients (“initiators” in iSCSI parlance). We don’t have access to the clients so we can’t find out what they’re doing. In short, there’s no easy access to the data, and it would probably be dangerous for us to try – we’d have to join their OCFS2 cluster to avoid corruption.
That doesn’t mean that we can’t monitor them though. As long as we give the monitoring node access to the iSCSI targets, we can access the LUNs. That’s enough to test reachability without escalating up the stack for deeper access; that’s the “blind” aspect of the check.
We’ve posted the code in a github repo in the hope that it might be useful to others. It’s pretty basic stuff, but it might be just the thing if you’re in a similar situation or aren’t getting enough love from more heavyweight checks that tend to use SNMP.
It’s important to note that this isn’t a standalone check, it’s part of a comprehensive monitoring and trending suite that we deploy. The servers providing the high-availability iSCSI targets are closely watched to ensure that we’re aware of any low-level issues, while this check provides assurance of correct behaviour at a higher level, as far as the handoff point to the clients.
We’re rounding out our series on high availability with a little discussion on the benefits of HA versus the inherent costs. If you’ve been keeping up with the previous articles you’ve probably gotten the impression that it’s a lot of work and easy to get wrong; you’d be correct.
That said, HA definitely has its place, so it’s worth arming yourself with the knowledge to assess when it’s appropriate.
HA systems have obvious financial costs, but there’s a lot more to it than just money. We’ll talk about these first because we think it’s important to have these in mind when you assess the benefits of going down the HA path.
The pattern you’ll notice is that almost everything boils down to complexity. Complexity is your enemy when building reliable systems, but it’s a necessary evil for HA.
Right off the bat you’ll be paying more for your hosting. The exact costs depend on the complexity of your deployment, but we generally find you’re looking at doubling or tripling your costs. In the simplest case you’ll be moving from a single server to a pair of servers. A website will probably need to make use of a load balancer, which is an extra expense.
A highly available solution also takes extra time to setup, which can be a substantial cost for a large deployment.
HA systems tend not to be as agile as their Ordinary Availability brethren. Changes need to be carefully considered, and necessarily take longer if there’s more systems to deal with.
As an example, consider a disk upgrade for a database server. On a non-HA server this can be done easily (and without downtime) if LVM is used and drives can be hot-added. A DRBD cluster will require the same steps, then a manual online extension and resync of the DRBD volume. It’s not terribly difficult, but it’s not simple enough to put into a script that can be blindly run in a matter of seconds.
HA systems are by no means maintenance-free, and many common operations need special attention when using HA. You need sysadmins available that are skilled in managing clusters safely, and this extends to out-of-hours periods as well in case something goes wrong.
Training and changes to internal processes
Anyone who regularly works on your website will have to modify their workflow to handle the HA deployment. This isn’t too onerous if you’re already using good development practices, but it requires a lot of discipline if you’re used to rolling out changes in an ad-hoc manner. Use of automation tools, such as Capistrano and Fabric, is practically mandatory to ensure smooth running.
The benefits of running a highly available environment are straightforward to describe, but actually quantifying them is another matter. This varies heavily on a case-by-case basis, so you’ll need to figure out the numbers for yourself and see where things land.
When done right, you can expect a substantial improvement in SLA terms for an HA system (though it is actually possible for an HA system to cause more downtime than it prevents, due to the added complexity).
It’s important to know the limits of your system, and just what sort of failure you’re trying to mitigate. In most cases you’re protecting against the failure of a single server chassis. Even assuming 100% uptime from a cluster of servers, your datacentre probably only guarantees 99.982% for power feeds (a Tier-3 datacentre). These are worst-case figures though, and perfect uptime on a month-to-month basis is quite normal.
Time to recovery in the event of a failure
This is an easily overlooked benefit that’s really more important than straight uptime. A single (non-HA) server is likely to give many years of reliable service without any problems, and many issues can be avoided by using redundant components such as hotswap power supplies and hard drives, ECC memory, etc. When the server eventually fails, there will be a significant interruption to services as the chassis is replaced and configured.
This is where a highly available system really shines. In the same way that RAID gives you safety net when a drive fails, an HA cluster will keep going when a server fails. There may be a brief interruption as services are migrated to a working node, but downtime is measured in seconds or minutes, instead of hours.
By virtue of the fact that HA clusters are designed to tolerate servers going offline, it’s possible to perform most maintenance operations without downtime, unlike regular systems. This is especially convenient when it comes to hardware changes, as each server can be taken offline and upgraded or replaced at leisure.
We suspect that in many cases, the question is a financial one. It’s an unenviable position to be in if you have to justify a lengthy outage by saying that lost sales or productivity is cheaper than having a highly available environment. That’s ultimately what it boils down to though, and you’ll need to find those hard numbers for yourself. We’re always happy to advise on the technical side of things though.
This completes our series of articles on high availability, we hope you’ve found them engaging and enlightening. We wouldn’t dare to presume that we’ve addressed everything, so if you’ve got any questions feel free to let us know.
Fancy yourself a master when it comes to high availability systems? We’re hiring.
Every site is different, so this isn’t so much a tutorial as some things to watch out for. We’ll take a reasonably representative database-backed site and talk about what changes when we make it highly available.
For the purposes of demonstration we’ll use Magento, an e-commerce website written in PHP with a MySQL backend. As well as exemplifying the popular LAMP pattern, Magento allows for extensions that uses extra software components, which also need to be taken into consideration in a highly available setup.
It’s worth noting that these notes apply even to vastly different systems. Taking some big customers that we’ve worked on as examples, Github is effectively a Rails app and Testflight’s core is mostly Django – the problem is approached in the same way.
Types of problems you’ll face
The approach we’re taking is to separate the moving parts and make each one highly available. This has the benefit of making the system more scalable in the process.
The parts we’re dealing with are:
Load balancer, necessary for running the frontends
The webserver tier is generally the simplest to scale – just deploy more webservers and put them behind a load balancer. The catch here lies in keeping everything in sync, and sharing state between the servers.
Rolling out your codebase
Your site will need periodic updates for bugfixes and the occasional new feature, and this applies to any site. Content systems like Magento and WordPress generally have a one-click method to apply these, while something written in-house might use Capistrano, or something as simple as a subversion/git checkout from a repo.
When these occur, you’ll generally want to do some testing and allow for a clean changeover, and having a load balancer makes this very straightforward. To perform a change/upgrade on each frontend: manually remove it from the load balancer, apply the update, then reinsert it into the load balancer.
This way the end user never sees a half-ready server, and you can perform some testing before reinsertion if needed. If you like, you can extend this to do “blue-green deployments“.
Magento, along with pretty much every substantial website, uses sessions to remember users and provide a continuous browsing/shopping experience. This data is commonly stored on the server in files, tied to a cookie that the client keeps.
This breaks if you start using multiple servers as requests will tend to be spread over all the servers, resulting in inconsistent state depending on which server the user happens to reach, and a broken experience overall. Some apps can store state in the client itself, but this tends to be inefficient as more data is transferred with every single request.
The solution to this is to share session data between all the servers. The most common approach is to store sessions in a shared database, in effect turning a “file problem” into a “database problem”, which we’ll deal with in the next section.
Memcached is a simple, high-performance and widely-used database used for sharing session data between frontends. You run a single instance of memcached on the network and have all the frontends connect to it. The primary downside of memcached is that it’s purely in-memory – if your memcached server ever crashes, you lose all session data. It’s not the end of the world, but it makes for a poor experience if a customer is in the middle of a payment transaction.
Specifically for Magento, we’ve found these extensions for storing data in Redis, Cm_RedisSession and Cm_Cache_Backend_Redis. We love Redis because it behaves well and stores data to disk, unlike memcached. That’s a win in our books, and is ideal for HA Magento.
An alternative offered by load balancers is “sticky sessions”, which ensures that a given client always hits the same frontend server. We’re not fans of shifting persistence to the load balancer as it doesn’t actually make for proper seamless HA, and can have problems scaling up. Sticky sessions will also need to be expired from the load balancer (it has a finite amount of memory), and you’ll run into mismatches with the website’s idea of sessions.
Generated files and user uploads
A related issue is handling uploaded user content, most commonly image files. These need to go to some form of shared storage, along with any thumbnails that the site is likely to generate. A shared filesystem, such as NFS, is an easy way to do this. Another option is a clustered filesystem such a GFS or OCFS2. However you do this, the shared storage also needs to be highly available.
As a final point, you’ll want to keep the webserver (apache, nginx, etc.) config consistent across all the servers. We use Puppet for config management and automation, which makes things super simple. If you’re not doing something similar, you’re going to have a bad time.
So now you’ve got your frontends scaling out nicely and storing data in a database or shared filesystem. Now you need to make the storage layer highly available.
If you’ve been reading our HA articles, you’ll know that Corosync and Pacemaker are the way to go. Databases and filesystems are backed by DRBD storage, with a standby server ready to take over if the active server goes up in smoke.
This is our general formula for anything that needs on-disk storage; everything above the block device is just a service, which can be stopped and started on another node to effect a failover. This works great for NFS, MySQL, PostgreSQL and Redis, as well as more exotic things like AMQP servers.
The above sections cover almost all the problems you’re likely to run into. Even so, we went hunting to look for other problems you might face. Something we thought was interesting was integration with frontend caches and CDNs.
One particular piece of software we’ve worked with is Varnish, a high-performance caching proxy designed to reduce the load on webservers, which tend to serve large amounts of static content. Caching can be hard to get right, especially so in a distributed environment. Care needs to be taken to ensure that content is correctly cached, without inadvertently leaking sessions between users.
Dealing with CDNs is more closely related to handling server-generated files. If using such an extension for Magento, you’ll want to test that things behave properly when it comes to pushing content to the CDNs, with particular attention paid to any versioning that the extension performs.
The load balancer isn’t too special on its own, beyond the necessary HA-ification. Our preference is to run a pair of load balancers, each one a virtual machine to allow for easy scaling, with ldirector on each, and Pacemaker-managed virtual-IP for each service.
There’s a particular caveat when it comes to dealing with load balancers and source IP addresses: if your load balancer performs NAT before forwarding traffic to the frontend servers, the application will see the load balancer as the source-address instead of the client’s real address. This can be worked around if the load balancer adds an X-Forwarded-For header to the incoming request, but we prefer to use LVS’ Direct Routing mechanism and avoid the problem entirely.
There’s a lot to take into consideration if you’re planning to deploy a highly available website. Some sites can be patched up fairly easily, while for others it’ll be a lot of work. A well-written and architected site can make this easier, but it tends to come at the cost of added complexity, and can be harder to maintain in future.
Next time we’ll talk about some of the realities of an HA deployment. High availability is a great concept, but it doesn’t come for free, and in many cases it’s hard to argue that it’s worth it. What’s your website worth to you, and how much disruption would you really be willing to tolerate?
Interested in building state-of-the-art high availability websites and infrastructure? We’re hiring.
The HA binge continues, today we’re talking about high availability through clustering – providing a service with multiple, independent servers. This differs from the options we’ve discussed so far because it doesn’t involve Corosync and Pacemaker.
We’ll still be using the term “clustering”, but it’s now applied high up at the application level. There’s no shared resources within the cluster, and the software on each node is independent of other nodes.
A brief description
For this article we’re talking exclusively about naive applications that aren’t designed for clustering – they’re unaware of other nodes, and use an independent load-balancer to distribute incoming requests. There are applications with clustering-awareness built in, but they’re targeted at a specific task and aren’t generally applicable, so they’re not worth discussing here.
Comparison with highly available resources
Using Corosync and Pacemaker requires tight communication between nodes for proper functioning. In comparison, a clustered application need not have any communication between nodes.
This works on the assumption that incoming requests are stateless and independent. Multiple requests from the same client are likely to be processed by different nodes, and client-specific data should not be stored on cluster nodes as it won’t be shared around.
As such, application clustering is best suited to services that are short-running and/or stateless in nature. Good examples include web, FTP and VPN servers.
While conceptually simple, it’s a reasonably complex solution that needs careful attention to ensure that the assumptions of statelessness and independence hold true. In return, application clustering deals with failure very gracefully and needs minimal amounts of maintenance compared to non-clustered systems.
How it works
The above diagrams explain application clustering well enough, though there’s a few caveats that can’t be explained graphically.
The load-balancing mechanism is deliberately unspecified. It could be as simple as using round-robin DNS, though a “smarter” solution is usually used.
The load balancer needs to be smart enough to detect failed servers and remove them from the pool. If not done, a proportion of client requests will keep failing.
It’s for this reason that DNS round robin is not recommended, as there’s no simple way to remove a failed host.
The load-balancer is now a single point of failure, and thus also needs some form of HA protection. Anchor’s deployments make use of ldirectord/IPVS for load-balancing, managed by Corosync and Pacemaker to ensure high availability.
How failure is handled
Because the servers are all live, all the time, graceful handling of failures relies on the load balancer detecting a failed node, and removing it from the pool of servers. The load balancer detects failure by polling each server periodically, and checking for a healthy response.
In practice, some requests will hit a failing server before it’s removed from the pool. This is generally acceptable, and can be tuned by adjusting the load balancer’s failure thresholds.
As an example, we might poll every 5sec, require a healthy reply within 1sec, and drop the server from the pool if it returns 3 consecutive failures.
Suitability and summary
As part of a highly available architecture, application level clustering is right at home with webservers and other frontend services, typically backing onto HA databases and fileservers. As a bonus, application clustering also increases capacity in a roughly linear fashion as more servers are added, giving optimal utilisation of resources.
Most web applications can be clustered with some modifications, but there are a few pitfalls to watch out for. They mostly revolve around session handling and deployment, which we’ll cover in the next instalment.
Redis has become one of the most popular “noSQL” datastores in recent times, and for good reason. Customers love it because it’s fast and fills a niche, and we love it because it’s well behaved and easy to manage.
In case you’re not familiar with Redis, it’s a key-value datastore (not a database in the classic sense). The entire dataset is always kept in memory, so it’s stupendously fast. Durability (saving the data to disk) is optional. Data in Redis is minimally structured; there’s a small set of data types, but there’s no schema as in a traditional relational database. Thanks to some peculiarities in the way Redis is implemented, it can offer atomic transactions that are difficult to achieve in normal database products.
That’s not to say it’s perfect though. One of our larger customers uses Redis extensively and we’ve run into some limitations that just aren’t cool when you’re trying to juggle hundreds of gigabytes of data without dropping any of it.
Redis is still the best software out there for what they’re doing, so we set about making it work for us.
We want to have our cake and eat it too – that means we need data to be safely on disk while enjoying the blinding speed that Redis provides. Let’s talk about data persistence when it comes to Redis.
The default Redis config will periodically perform an “RDB dump”, a full copy of the dataset at a point in time, straight to disk. This is great for small amounts of data, and convenient for quick restarts. However, because it’s a point-in-time snapshot of the dataset, you’ll lose any subsequent changes in the event of a crash.
The alternative is AOF logging (“Append-Only File”), which collects every redis command into an ever-growing logfile. In the event of a crash, Redis replays the whole log from scratch to recreate the dataset. This means you don’t lose any data if the server crashes, but the replay is slow and the AOF file will keep getting bigger unless you prune it periodically.
Both RDB and AOF persistence have their uses, and you can use both at the same time. That’s well and good, but there’s a few specific things that we care about:
No data loss in the event of a crash
Fast startup (important for high availability)
Solid manageability from a high-availability standpoint
We set about solving these problems, one by one. The first one, at least, is solved by making use of AOF persistence.
Dynamic listen address specification
One of the first improvements we made was allowing a graceful failover to a replication slave. Redis supports replication out of the box, and it works great, but you can’t readily promote or replace a slave to make it a master. This is because the IP address that Redis listens on is set at startup, forcing you to put some sort of load-balancer or proxy in front. It’s really unnecessary when all you want to do is failover a single Redis instance.
So we fixed it, you can now change the listening IP address for a Redis instance, allowing a failover or migration with a couple seconds of downtime. The alternative is to restart the daemon, which takes a long time when you have a large dataset.
Backup dumps to an external command pipe
Regular RDB backups are convenient because they’re dumped straight to a self-contained file on disk. That’s fine for standard nightly backups, but there’s a bit more hassle in getting them off the machine if you’re concerned about quick recovery if the server goes up in smoke. You also incur heavy I/O penalty as everything is written to disk, only to be pulled off to a remote server a short time later.
We developed the PIPESAVE command to work around this; instead of dumping to disk, the Redis server runs a command specified in the config file and shovels bits down the pipe. This is great for offsite backups to Amazon S3, etc.
In-band RDB dumps to the client
PIPESAVE is pretty cool, but it’s a “push”-based backup solution. Real backup systems prefer to “pull” data from the server, so they can manage scheduling themselves. A backup system would ideally fetch an RDB dump straight from the Redis server, across the network, without the need for an intermediary file.
That’s what our DUMPSAVE command does. After making an ordinary Redis connection, the Redis client can request a DUMPSAVE using the standard Redis protocol. The Redis server will then switch to a raw, non-protocol mode and pump an RDB dump straight down the wire to the client, which can be pushed directly to disk.
We’ve found that this makes for a fantastic backup solution that can be driven by the backup server, on whatever schedule we find suitable. You can find both the pipesave and dumpsave enhancements in one of our topic branches at github.
AOF background rewrite monitoring
AOF files will grow forever, but thankfully Redis comes with functionality to consolidate an AOF file into a minimal set of commands needed to reconstitute the dataset. Redis does this in a background rewrite process to avoid holding up the master process.
This is very good, but there’s no way to know when a rewrite was last performed. The Redis server notes when the last RDB dump was performed, but there’s no equivalent for AOF backups. Until now, that is. Our nagios monitoring checks for the time of the last AOF rewrite and will notify us if it’s been too long since the last clean rewrite.
Truncate short AOF writes
It’s possible for an AOF write to bomb out and not write a full record to the end of the file (eg. if the disk is full). The official docs helpfully mention that the redis-check-aof command can fix this for you, but that takes a long time when your AOF file is massive, and really isn’t fun when you can’t restart Redis because of a few lousy bytes at the end of the file.
This really shouldn’t be necessary, so we fixed it. The background rewrite will now try to clean up after itself if it gets in a pickle.
Retry writes to the AOF if we run out of space
AOF files use a lot of diskspace. That’s okay because disk is cheap. However, a background rewrite of an AOF file can cause a huge spike in disk usage, and it’s possible for that to get out of hand before anyone notices.
In the event that a background rewrite fails due to running out of diskspace, Redis will crash hard. There’s a certain cruel irony in the fact that the background rewrite process, designed to stop you from running out of diskspace due to an ever-growing AOF file, itself causes an out-of-space condition and kills the Redis server.
We’ve fixed this by making the usual AOF logger a bit more relaxed. If we run out of diskspace, the background rewrite process will eventually die, but the main AOF logger will retry the write in the hope that it might eventually succeed.
Redis is a very nifty piece of software. And, thanks to it being well written, we’ve been able to extend and improve the functionality to meet our needs and make it more robust.
With these improvements, we now suffer less downtime in the event of an outage, less chance of an outage to begin with, faster failovers on HA systems, and backups that are timely and produce minimal extra load on the Redis server. That’s pretty cool.
The caveats are what we’ll deal with today. Sometimes you’re dealing with software that won’t play nice when moved between systems, like a certain Enterprise™ database solution. Sometimes you can’t feasibly decompose an existing system into neat resource tiers to HA-ify it. And sometimes, you just want HA virtual machines! This can be done.
If the solution to our problem is to run everything on a single server, so be it. We then virtualise that server, and make it highly available.
Once again, it’s important to remember that we’re guarding against a physical server going up in smoke. There’s no magic scalability here, and ideally the HA subsystem never actually does anything, except when there’s a major problem.
As per our standard setup, we’re using KVM for virtualisation as it’s been mainlined into the Linux kernel.
A highly-available VM is really simple, it comprises just two pacemaker resources:
DRBD for replicated storage
Running of the VM itself
After this, everything else is pretty standard – DRBD needs to start before the VM, and stop after it. The VM management is a new type of pacemaker resource, and that’s it. The start/stop/monitor actions in the resource agent script call out to the libvirt library, and let it handle the hard work.
How failure is handled
At this point, the VM is an opaque black box. As long as libvirt reports that the VM is running, pacemaker won’t do anything. This is good because it means you can treat the VM as a normal machine; apply all your usual monitoring for the Enterprise database app, and kick it as usual when it breaks. A BSoD or kernel panic is nothing special either: the VM is still running.
The failure case that we do care about is if one of the KVM hosts stops working. If the VM monitor action times out or the host stops responding, the standby node in the clustered pair will notice, possibly STONITH the bad node, and take over the running of your VM.
It’s important to know what this means for your VM: Pacemaker will attempt to cleanly shut down the VM, then yank the virtual power cord if that fails. This means that when the VM comes up on the standby node it will have to deal with an unclean shutdown, which can take a long time if a fsck/chkdsk is needed. HA cannot help you in this scenario!
Things to note
Pacemaker adds an extra layer of fun if you forget (or don’t know) that it’s keeping an eye on things: it’ll keep restarting a VM that you’re trying to shutdown, unless you tell it to stop managing it. This doesn’t happen on an ordinary “services deployment” because pacemaker will hand off resources when it shuts down. Watching an unwitting sysadmin deal with this is like playing with a roly-poly toy.
Summary and evaluation
While not without limitations, HA for whole VMs can be very convenient. At its extreme, it allows you to offer high availability for servers that you don’t even have login access for.
One catch is that it can be expensive – each KVM host needs enough RAM and diskspace to run all of the VMs in the event of a failure. If you have many VMs on an HA pair there’s a lot of unused computing capacity, which tends to have a large capital cost upfront.
The Linux HA suite offers robust solutions, but doesn’t always ensure the best utilisation of resources. Next time we’ll start talking about high-availability through load-balancing and redundancy, which can be a very nice way to get the scalability and availability that you need if you’re willing to make substantial changes to your application architecture.
Now that we’ve got our terminology sorted out, we can talk about real deployments. Our most common HA deployments use the Linux HA suite, with multiple services managed by pacemaker. This is roughly the “stack” that we referred to in the first post in the series.
We’ve already covered the resources involved, so we’ll focus on the important bit: What happens when something goes wrong?
Recall that on our hypothetical HA database server, we’ve got the following managed resources:
Floating IP address for the service
The DB service itself
Each resource has its own monitor action, specified by the Resource Agent (RA). Roughly speaking, an RA is a script that implements a common interface between pacemaker and the resources it can manage. It looks a lot like an initscript, but more rigorously defined.
The monitor action is straightforward: pacemaker runs it regularly (20sec is a normal interval), and it either says the resource is running fine, is not running, or the action times out. So long as pacemaker keeps hearing good news, nothing exciting happens.
Before we go too much further, let’s quickly discuss what “monitor” means.
Monitoring cluster resources
Each resource needs some sort of monitoring to be useful. Pacemaker doesn’t care how it works, so long as it happens. “Success” in the monitor action means:
For a DRBD device we check that the kernel module is loaded, and that the local node is in either the DRBD Primary or Secondary role
A filesystem must be mounted (check /proc/mounts). We can optionally also check that the filesystem is writeable
An IP address is bound to an interface
A database must answer a basic SELECT query over a standard client connection
Monitoring is pretty straightforward, but it’s important (and sometimes difficult) to write monitoring actions that accurately reflect the state of the resource, without depending on correct functionality of an unrelated component.
An example of this would be a network fault causing problems for an NFS mount, which affects your ability to read the status of a local (pacemaker-managed) filesystem.
Recovering from monitoring failures
So what happens if a monitor action fails? If the resource isn’t running, pacemaker will try to run the start action to bring it up. If the monitor action times out, it will try to cleanly stop the resource and then start it again, possibly on the other cluster node.
This makes for a resilient system that tries to repair itself in the face of failure. Things get more interesting if recovery also fails, and there’s where STONITH steps in.
When recovery fails
All of the stop/start/monitor actions have timeouts built-in, and pacemaker will attempt to handle a timeout condition as well. We’ve already seen that a monitor-timeout translates to a stop and start. A timeout on start isn’t a big deal, we can try to start it again. A failure or timeout on stop is considered critical.
A broken resource that can’t be stopped gracefully needs to be taken by force. We’ve already covered this pretty well in the first article so we won’t dwell on it, but it suffices to say that you’ll incur a bit of downtime as the cluster sorts things out and brings services up again.
Summary and evaluation
The deployment we’ve described is reliable and well-behaved. Because each resource is self-contained and independent, any problems are usually straightforward to diagnose and repair.
We’ve found that most services can be decomposed into a similar stack of resources – in the end it’s just a daemon being started-up on a server, and sometimes the server it runs on changes.
There are some services that don’t play nicely this way though, and sometimes you want to manage something bigger, like a VM. We’ll cover this in our next post on high availability deployments.
Following our previous post on the basics of high-availability services, it occurred to us that there’s often some confusion about the use of certain terms and phrases. We’d like to clear that up before pressing on, and hopefully reduce some of the headaches for people in the long run.
We’re dealing with a few closely related terms here, with important differences in meaning:
High Availability (HA) is a concept and a goal. How you achieve it is up to you, but the implication is that it involves more than one server, because a single server is a single point of failure.
Having a hot-standby server to takeover in the event of a failure is one way to get HA. For certain types of services this is the most appropriate method, and Anchor uses the Linux HA software to do this.
Another option is to run a team of identical servers, all serving requests, with the intention that if some of them fail, the others will keep going. This is generally called load-balancing, and is best suited to things like web frontends serving http/https requests.
Load Balancing refers to having a pool of two or more servers serving requests for clients. Each server is identical as far as clients are concerned, it doesn’t matter which server in the pool answers the request.
Load balancing is a popular HA technique because it’s also scalable – capacity can be increased roughly-linearly by adding more servers to the pool. Failures are handled gracefully by dropping the server from the pool.
Load balancing isn’t a free lunch though. The load balancer itself, which sits in front of the servers and distributes incoming requests evenly, is a critical single point of failure. Now you need HA for your load balancer, and you might also hit the performance-wall if you see enough traffic.
It’s also important to keep in mind that a given client won’t always be served by the same server in the pool. This breaks some common assumptions about how things work (eg. session state), so a degree of care is needed when using load balancing, and some services are difficult if not impossible to load balance (eg. databases).
The Linux HA project is a suite of software components used for building high availability systems, and is considered to be The HA solution on Linux. The primary components are Corosync (the cluster messaging layer) and Pacemaker (the cluster resource manager).
Linux HA manages “resources” on a cluster of servers. If a resource stops working or the server hosting a resource dies, it’s failed-over to another server in the cluster to keep it running. Linux HA doesn’t perform any load balancing, there are other tools such as ldirector for that purpose.
Corosync and Pacemaker are powerful tools, but also very complex. The learning curve to get started is steep, and maintenance requires a fine hand to avoid shooting yourself in the foot. Properly implemented, a Linux HA cluster is very reliable and delivers excellent uptime. If that’s what you’re after, why not employ experts, like yours truly here at Anchor?
That’s it for this instalment, feel free to leave a comment if anything isn’t clear or needs elaboration.
Next time we’ll talk about different options for deploying Linux HA clusters, and what’s suitable in various situations.
In what we plan to be a small series of articles about our high availability deployments, we thought we’d start by defining the key components in the stack and how they work together.
In future we’ll cover some of the more specific details and things that need to be taken into consideration when deploying such a system. For now we’ll talk about the bits that we use, and why we use them.
Type of deployment
A highly available system is also highly complex, so it’s important to know just what problem you’re trying to solve when you take on that burden.
Our systems are designed to deal with the total failure of a server chassis. This is very low-level and was chosen because it provides the greatest flexibility when dealing with various software stacks.
To be clear, this is not at all like a clustered application, which is written to run on multiple servers at once. In our setup the active server can fail, and the standby server will step in to take the load.
A high-availability deployment can be be as large and complex as you want, but we like to keep things simple. Some nomenclature:
We’ll only talk about two-server deployments, which covers almost every system we manage
In a two-server setup, one is “active” and the other is “standby”
Together, these form an “HA cluster”
Each server in the cluster is an “node”
Because we’re dealing with whole-machine failure scenarios, we use servers with identical specifications to build the cluster.
Each chassis must be powerful enough to shoulder the full load on its own, as there’s no expectation to share the load within the cluster.
Now that we’ve got the basics out of the way, we’ll present a fairly common use case for such a setup: a highly-available PostgreSQL database server.
Note that we’ve not trying to use replication here, that’s for solving a different problem. Replication could be used to effect the same outcome in this scenario, but it introduces a different sort of complexity and more work to repair things when the active server fails.
Total hardware failures aren’t terribly common. The point of HA here is to mitigate the risk of extended downtime if things go bad, and squeeze out an improved uptime figure. As a bonus, routine maintenance can be carried out on the cluster servers with minimal disruption to services.
At its most basic, running a cluster is a matter of ensuring all the members are talking to each other, on the same page, and then sending messages to negotiate who should be running a particular cluster-managed service.
Corosync is the messaging layer of the cluster, effectively holding everything together. It handles membership of the cluster and ensures that problems are detected very quickly. This information is communicated up the stack to the Cluster Resource Manager (crm) in Pacemaker, whose job it is to actually do something about it.
Making use of the Corosync cluster engine, it’s Pacemaker’s job to actually take care of the managed resources in the cluster.
While we tell Corosync about the nodes in the cluster, we tell Pacemaker about what resources to run in the cluster, and how it should be done.
A resource is just anything that Pacemaker can manage. While it can be almost anything you like, typical examples are DRBD devices, filesystem mounts, IP addresses, etc.
Just starting resources isn’t enough though – we need to make sure that resources are started in the correct order, and on the right node. This is where constraints come in (eg. start A before B, and B before C). For example, we can’t mount a filesystem until the underlying DRBD block device is up and running. Similarly, we can’t start a daemon that listens on the network until its IP address is brought up on the same machine.
Now that we have the management components out of the way, we can talk about the building blocks of actually running an HA database on it.
Without constraints, HA resources are effectively independent. To be useful to us, we build them into a stack. Resources higher in the stack necessarily depend on resources further down the stack, as described in the previous section.
The stack isn’t part of Pacemaker’s config, it’s purely conceptual. In action, we’ll push the whole stack of resources between cluster nodes.
In rough order that they appear in the stack, we’ll look at the DRBD storage, filesystem, IP addresses, and the database daemon.
DRBD stands for Distributed Replicated Block Device, and can be thought of as RAID-1 over a network. DRBD provides us with a block device that is guaranteed to be identical at both ends, giving us a form of shared storage between two cluster nodes.
Because DRBD presents a generic block device to the system, it can be formatted with a filesystem and used exactly as you would any other storage medium.
A DRBD device is a Pacemaker-managed resource, with the constraint that it can only be used on one cluster node at a time (the one that will run the database daemon).
DRBD must be prepared before we can mount the resident filesystem, which is the next step.
Before using a DRBD device for the first time we create a filesystem, usually a vanilla ext3 or ext4. Once prepared, we can then have Pacemaker manage the mounting/unmounting at /var/lib/pgsql.
The filesystem can only be mounted on one cluster node at a time, which Pacemaker will guarantee. A cluster-filesystem can be multi-mounted, but would provide no benefit in this scenario.
The filesystem must be mounted after DRBD is started, and before we attempt to start the Postgres daemon.
To provide a consistent entry point to the database, we create a special “floating” HA IP address that will always be present on the active cluster node.
Like the other resources, the IP address can only be used on one cluster node at a time. Pacemaker will handle this for us.
The IP address can be brought online at any time (eg. while the DRBD device and filesystem are being prepared), but it must be before Postgres is started.
Things will sometimes break or malfunction in an HA cluster; this is expected. Some types of failure are tolerable (eg. by retrying), while others are more critical. A communications failure is the latter.
STONITH exists to solve a problem called “split-brain“. If the two nodes can’t talk to each other, they can’t be sure who’s at fault. Because it’s their job to make sure all the resources are running, they’ll both want to take the “active” role in the cluster. This is the split-brain.
A split-brain situation is dangerous because both nodes will attempt to use resources that can’t be shared. If a communications failure is preventing us from asking the other node to gracefully let go of (“stop”) a resource, we use the nuclear option and switch off power to the other node.
As an example, assume we have two nodes Alpha and Beta that manage an ext3 filesystem.
The filesystem is currently mounted on Alpha
A clumsy datacentre technician is moving some cabling and inadvertently unplugs the switch carrying the cluster traffic
Alpha thinks that Beta has crashed. This is no big deal, the filesystem is still mounted
Beta thinks that Alpha has crashed, oh no! We need to unmount the filesystem on Alpha and mount it locally on Beta
The network is down, so Beta can’t ask Alpha to unmount the filesystem
Beta invokes a STONITH action on Alpha, it’s the only way to be sure! Alpha’s DRAC receives a poweroff command and promptly shuts down, hard
Beta now mounts the filesystem. It needs a fsck because it wasn’t cleanly unmounted, but we’re up and running again shortly
STONITH is obviously a very violent operation, so we want to make sure it only kicks in when things have really gone bad and we’re out of options to get the resources started. We guard against this possibility by having redundant links for our cluster traffic.
That wraps up our introduction to HA. In the near future we’ll talk more about how HA clusters are used as part of a larger system, and the kinds of considerations you need to make when adding one to your architecture. If anything is unclear or you just have a burning question, feel free to leave a comment.
Update from 2012-05-24: The Corosync devs have addressed this and a patch is in the pipeline. The effect is roughly as described below, to build the linked list by appending to the tail, and preferring an exact IP address match for bindnetaddr (which was intended all along but got lost along the way). Rejoicing all round!
We’ve been looking at some of Corosync’s internals recently, spurred on by one of our new HA (highly-available) clusters spitting the dummy during testing. What we found isn’t a “bug” per se (we’re good at finding those), but a case where the correct behaviour isn’t entirely clear.
We thought the findings were worth sharing, and we hope you find them interesting even if you don’t run any clusters yourself.
Disclaimer: We’d like to emphasise that this is purely just research so far; we have yet to formalise this and perform further testing so we can ask the right questions of the Corosync devs.
Before signing-off on cluster deployments we run everything through its paces to ensure that it’s behaving as expected. This means plenty of failovers and other stress-testing to verify that the cluster handles adverse situations properly.
Our standard clusters comprise two nodes with Corosync+Pacemaker, running a “stack” of managed resources. HA MySQL is a common example is: DRBD, a mounted filesystem, the MySQL daemon and a floating IP address for MySQL.
During routine testing for a new customer we saw the cluster suddenly partition itself and go up in flames. One side was suddenly convinced there were three nodes in the cluster and called in vain for a STONITH response, while the other was convinced that its buddy had been nuked from orbit and attempted to snap up the resources. What was going on!?
It was time to start poring over the logs for evidence. To understand what happened you need to know how Corosync communicates between nodes in the cluster.
A crash-course in Corosync
Linux HA is split into a number of parts that have changed significant over time. At its simplest, you can consider two major components:
A cluster engine that handles synchronisation and messaging; this is Corosync
A cluster resource manager (CRM) that uses the engine to manage services and ensure they’re running where they should be; this is Pacemaker
We’re only interested in Corosync, specifically its communication layer.
There’s two major types of communication in Corosync, the shared cluster data and a “token” which is passed around the ring of cluster nodes (this is a conceptual ring, not a physical network ring). The token is used to manage connectivity and provide synchronisation guarantees necessary for the cluster to operate.
The token is always transferred by unicast UDP. Cluster data can be sent either by multicast UDP or unicast UDP – we use multicast. In either case, the source address is always a normal unicast address.
Given this, a 4-node cluster looks something like this:
One convenient feature in Corosync is its automatic selection of source address. This is done by comparing the bindnetaddr config directive against all IP addresses on the system and finding a suitable match. The cool thing about this is that you can use the exact same config file for all nodes in your cluster, and everything should Just Work™.
Automatic source-address selection is always used for IPv4, it’s not negotiable. It’s never done for IPv6, addresses are used exactly as supplied to bindnetaddr.
Interestingly, you only supply an address to bindnetaddr, such as 192.168.0.42 – CIDR notiation is not used, as might be commonly expected when referring to a subnet. Instead, Corosync compares each of the system’s addresses (plus the associated netmask) against bindnetaddr, applying the same netmask. This diagram demonstrates a typical setup:
The problem as we see it
The key here is in two parts. Firstly, it’s possible for a floating pacemaker-managed IP to match against your bindnetaddr specification. As an example:
192.168.0.1/24 – static IP used for cluster traffic
192.168.0.42/24 – a floating “service IP” used for an HA daemon
Secondly, Corosync sometimes re-enumerates network addresses on the host for source address selection. We’re not 100% sure of the circumstances under which this occurs, but a classic example would be while performing a rolling upgrade of the cluster software. A normal process would be to unmanage all your resources, stop Pacemaker and Corosync, upgrade, then bring them back up again and remanage your resources.
Taken together, this can lead to the cluster getting very confused when one host’s unicast address for cluster traffic suddenly changes. Consider a 2-node cluster comprising nodes A and B:
A’s address changes
The token drops as a result of the ring breaking
B thinks A is offline, as A’s new address isn’t allowed through our outbound firewall rules, so B doesn’t receive anything from A
A also thinks B is offline, because B can’t hear A
In the meantime, A will “discover” a new cluster node: itself, operating on the new address!
This state is passed on to Pacemaker, which attempts to juggle the resources to satisfy its policy. For resources that were already running on A, it now sees duplicate copies, which isn’t allowed. Original-A asks for new-A to be STONITH’d. Meanwhile, B is just trying to get all the resources started again.
This is decidedly not ideal. What really got us is that we hadn’t seen this behaviour in any previously deployed cluster; something was different.
Normally we’ll dedicate a separate subnet to cluster traffic, keeping it away from “production traffic” like load-balanced MySQL or HTTP. This time we didn’t, opting to reuse the subnet. We did this because we don’t like “polluting” a network segment with multiple IP subnets. We could have setup another VLAN to confine the subnet (avoiding pollution), but it would’ve meant putting that into our switches just for two physical machines, which seemed like overkill.
Reusing the existing subnet for cluster traffic clearly had something to do with our problem, so it was time to go digging in Corosync to see what it does.
How Corosync can select a different address
In the Corosync source there’s a function called totemip_getifaddrs(), it gets all the addresses from the kernel and puts them in a linked list. For simplicity, you can think of them as tuples of (name,IP). The name will be based on the device’s familiar name, but includes labels if they’re present; eg. eth0, eth0:00, eth0:nfs are all fine.
The list is built by prepending each item to the head of the list. As a result, “later” addresses appear at the head of the list. This means that when Corosync goes to traverse the list, it hits them in the reverse order of what a human would tend to expect (the kernel’s listing is whatever order getifaddrs() returns, which is likely arbitrarily ordered, but probably “sane” as far as we’re concerned).
When Corosync searches the linked list for a match, it stops on the first one it finds. Of course our list is backwards, so a newly added address, such as a floating HA IP, is likely to be selected if it’s viable.
This diagram shows an entirely plausible linked list that could be produced by totemip_getifaddrs():
This is good – we know what’s going on now. We need a solution though, and of course this cluster is due to be handed over yesterday.
A workaround is simple enough, and it’ll let us keep our overlapping subnets. It should be noted that this is very much a workaround – it’s not even clear that the behaviour we’ve seen is a problem, the answer could be as simple as “you shouldn’t do that”.
We’ve chosen to tackle the problem in two ways, to provide a little defence in depth. Either one should do the job on its own, but it doesn’t hurt to be sure.
Skip the IP if the name has a colon in it. It’s specific to the way we handle IPs, but will probably work for most people.
Append to the tail of the list to maintain the expected ordering.
The patch is really simple. For brevity, this is the Linux code path, it also applies just the same to the Solaris code further down the file.
The getifaddrs() interface used is pretty limited. The kernel has additional flags to denote Primary and Secondary addresses, which might be useful when selecting a good source address, but they’re not available through getifaddrs(). As such, Corosync has no way to use additional criteria to filter the results of the query when selecting a source address.
We make a habit of labelling all our addresses, even the floating HA ones. That means it’s easy to ignore them by checking for the colon-delimiter in the interface name. Ideally Corosync would have another (smarter) way of getting address information from the kernel, but portability concerns may make this difficult.
Other ways to dodge this
Clearly we’ve been able to avoid this problem in the past, but it’s not the only way.
Use a separate subnet and NIC for cluster traffic so this doesn’t happen
Alter the behaviour of bindnetaddr such that it will prefer an exact match if it’s available, otherwise fall back to the smart selection as usual
For now we’ve opted to make a patch to implement the latter behaviour. That’ll cover us for now while await upstream feedback.
Cluster data is enqueued on each node when it’s received. When a node receives the token, it processes the multicast data that has queued-up, does whatever it needs to, then passes the token to the next node.