What does it take to make your application reliable?

Reliability Defined

  • Services available to users at least 99.8% of the time. In other words suffer from degradation for no more than 1.5 hours per month.
  • Capacity to scale operations at short notice and without interruption.
  • Maintaining continuity and integrity through security.
  • Elimination of single points of failure, including staff.
  • Routine proactive and preventative maintenance.
  • 24 × 7 monitoring to detect problems.

To achieve reliability we must address the environment, connectivity, people and planning behind your application.

Redundancy

Achieving very high levels of availability relies on all systems working correctly all of the time. Any single point of failure needs to be removed or accounted for. Including power supplies, network connections, monitoring systems, people, climate control and server hardware.

Environment

Physical security of the area housing the equipment must be maintained. Access should be restricted to authorised personnel. An alarm system should be fitted to detect security breaches.

With such high uptime targets the reliability of mains power is not sufficient. Redundant power supplies are needed. Short outages can result in extended outages if devices fail to automatically return to their correct state. Uninterruptible Power Supplies for short term outages and onsite diesel generators for extended outages are mandatory.

Servers continuously generate large amounts of heat. Redundant high capacity cooling systems are required maintain a stable operating temperature.

As the dependence on online applications grows so does the infrastructure required to support them. Allowance must be made not only for existing equipment and current capacity but an expansion of this, whilst not suffering extensive outages whilst upgrading.

Connectivity

The connection between the equipment running your application is perhaps one of the most critical parts of maintaining reliability.

Outages due to cable cuts, telco faults and configuration errors can often take hours or days to resolve. Multiple connections are a must.

When operating multiple connections ensuring traffic is correctly routed between the two (depending on their state) is not a simple task and relies on expensive and complex devices.

The Internet is dynamic and unpredictable, the capacity you require today might not be enough for tomorrow. You need to always keeps lots of spare capacity to accommodate peak loads.

People

Making sure your online application is reliable is just as much about having people you can rely on, as it is environments and devices.

Reaching 99.8% uptime means responding to problems on a 24 × 7 basis and ensuring there’s a team of people , not just and individual, available to do this.

The obstacles to achieving 100% uptime vary. Some are easily overcome whilst others require the expertise of highly qualified and experienced technical staff that specialise in managing online applications.

Keeping a car on the road that never breaks down requires routine maintenance and computers are the same. All the systems that go into supporting your application need continual preventative management to detect problems before they cause downtime.

Planning

As you can see by now, keeping systems online with uptime as close to 100% as possible is a complex task with a large number dependencies. Getting all of the different parts to work together in the dynamic environment that is the Internet requires continuous attention to planning and future proofing