Application Clustering for High Availability
The HA binge continues, today we’re talking about high availability through clustering – providing a service with multiple, independent servers. This differs from the options we’ve discussed so far because it doesn’t involve Corosync and Pacemaker.
We’ll still be using the term “clustering”, but it’s now applied high up at the application level. There’s no shared resources within the cluster, and the software on each node is independent of other nodes.
A brief description
For this article we’re talking exclusively about naive applications that aren’t designed for clustering – they’re unaware of other nodes, and use an independent load-balancer to distribute incoming requests. There are applications with clustering-awareness built in, but they’re targeted at a specific task and aren’t generally applicable, so they’re not worth discussing here.
Comparison with highly available resources
Using Corosync and Pacemaker requires tight communication between nodes for proper functioning. In comparison, a clustered application need not have any communication between nodes.
This works on the assumption that incoming requests are stateless and independent. Multiple requests from the same client are likely to be processed by different nodes, and client-specific data should not be stored on cluster nodes as it won’t be shared around.
As such, application clustering is best suited to services that are short-running and/or stateless in nature. Good examples include web, FTP and VPN servers.
While conceptually simple, it’s a reasonably complex solution that needs careful attention to ensure that the assumptions of statelessness and independence hold true. In return, application clustering deals with failure very gracefully and needs minimal amounts of maintenance compared to non-clustered systems.
How it works
The above diagrams explain application clustering well enough, though there’s a few caveats that can’t be explained graphically.
- The load-balancing mechanism is deliberately unspecified. It could be as simple as using round-robin DNS, though a “smarter” solution is usually used.
- The load balancer needs to be smart enough to detect failed servers and remove them from the pool. If not done, a proportion of client requests will keep failing.
- It’s for this reason that DNS round robin is not recommended, as there’s no simple way to remove a failed host.
- The load-balancer is now a single point of failure, and thus also needs some form of HA protection. Anchor’s deployments make use of ldirectord/IPVS for load-balancing, managed by Corosync and Pacemaker to ensure high availability.
How failure is handled
Because the servers are all live, all the time, graceful handling of failures relies on the load balancer detecting a failed node, and removing it from the pool of servers. The load balancer detects failure by polling each server periodically, and checking for a healthy response.
In practice, some requests will hit a failing server before it’s removed from the pool. This is generally acceptable, and can be tuned by adjusting the load balancer’s failure thresholds.
As an example, we might poll every 5sec, require a healthy reply within 1sec, and drop the server from the pool if it returns 3 consecutive failures.
Suitability and summary
As part of a highly available architecture, application level clustering is right at home with webservers and other frontend services, typically backing onto HA databases and fileservers. As a bonus, application clustering also increases capacity in a roughly linear fashion as more servers are added, giving optimal utilisation of resources.
Most web applications can be clustered with some modifications, but there are a few pitfalls to watch out for. They mostly revolve around session handling and deployment, which we’ll cover in the next instalment.