A client assignment to secure extreme availability (> 5 Nine ’s) of large scale
enterprise system for both physical and cloud deployments. The approach followed was to
break the system through random failure introduction, monitor for compromises in system
availability and design in resilience functionality.
Ensure that system is fully resilient to internal
and external failures and critical functionality is
always available in a production environment.
A secondary goal was to develop and maintain
an automated regression test to ensure system
availability not degraded during system development.
The overall solution was modelled on 'Chaos Monkey' practices pioneered by
Netflix and other companies deploying on Amazon Web Services.
Creation of random system failure and stress testing through invasive
software based agents.
Measurement and monitoring of system behaviour and reaction to failure and
stress testing. Measurement and monitoring of system availability during
stress testing. Measuring availability during system upgrade and preventing
Designing in resilience and recoverability in response to any observed
lack of availability during failure/ stress testing. Constant evolution of
new failure scenarios based on issue slip through
Building tests into Test Automation and Continuous Integration frameworks
to ensure no degrade in system availability as new functionality is added
during development phase.
The availability tests were originally designed for a Physical Deployment
and subsequently modified to work in an OpenStack Private Cloud Deployment.
Client has enjoyed reduced cost for maintenance and support of their
systems, increase in customer satisfaction and
enhanced reputation due to meeting customer
expectations on system availability and resilience.