Uptime Guarantee
Our business is about delivering reliable services. We provide a guarantee of uptime across the services we manage which is met through a three pronged approach:
- Prevention
- Detection
- Response
Our 99.8% uptime guarantee is met by delivering on all three components.
Prevention
Preventing errors from causing outages is dealt with in two ways. Firstly through redundant systems we stop outages from affecting live services. Secondly through preventative maintenance we catch possible triggers before they cause outages.
Redundancy is implemented on all core network infrastructure down to the point immediately prior to the client handoff.
At the server level all common points of failure redundancy can be provisioned. This includes hot swappable hard drives and power supplies.
Hardware testing: we do not assume new server hardware to be in a working state. All new equipment undergoes a pre- deployment burn-in process which is less tolerant of errors than manufacturer testing.
Operating system & application hardening: all servers hardened at time of deployment through firewalling, disabling uneeded services, tcp wrappers, least privilege, application vendor updates and defence in depth principles.
Monitoring and analysis: on a daily basis server activity logs are reviewed for signs of unexpected behaviour.
Detection
Inevitably problems do occur on all systems. When they do rapid and comprehensive detection systems ensures they can be dealt with.
Custom built monitoring services are implemented across all services. The monitoring systems two facets: availability and performance.
Availability monitoring covers up to 20 different data points on any given server, with a further 5 watched by performance monitoring systems.
Availability monitoring systems poll services on a 2 minute interval. When an event triggers a response alerts are immediately issued via instant messaging and SMS. Alerts are issued to 4 people on a 24 × 7 basis.
External systems ensure that the monitoring services themselves remain continuously operational.
All outage periods are logged and reported within the monitoring application for future diagnostics reference.
Response
Once a problem is detected it is escalated via well defined support procedures to ensure it reaches the correct person to resolve the problem.
A number of technologies and philosophies contribute to rapid resolution of faults:
- Remote management capability: all infrastructure can be remotely managed by Anchor staff via secure (vpn) channels. This capacity extends both to our primary management site and remote workers.
- Remote reboot systems are in place on all devices providing the ability to rapidly reset the power without leaving the operator leaving his seat.
- Lights out management can be deployed to allow detailed diagnostics and control of equipment that does not respond to power cycling.
- Maintaining high uptime targets exceeds the ability to rely on hardware vendor warranties. To this end all equipment is deployed across a standardised platform backed by a common inventory of onsite spare components. Complete systems failures can always be rectified using this inventory.
- Automated problem rectification: as failsafe backup to human intervention monitoring systems can be configured to automatically restart services which have failed.