Parwy's Blog: Lessons in Software Reliability

15 August 2011

Lessons in Software Reliability | Agile Zone

Hire good developers and give them enough time to do a good job, including time to review and refactor.

Make sure the development team has training on the basics, that they understand the language and frameworks.

Regular code reviews (or pair programming, if you’re into it) for correctness and safety.

Use static analysis tools to find common coding mistakes and bug patterns.

Design for failure
Failures will happen: make sure that your design anticipates and handles failures. Identify failures, contain, retry, recover, restart. Contain failures, ensure that failures don’t cascade. Fail safe. Look for the simplest HA design alternative: do you need enterprise-wide clustering or virtual synchrony-based messaging, or can you rely on simpler active/standby shadowing with fast failover?

Keep it Simple
Attack complexity: where possible, apply Occam’s Razor, and choose the simplest path in design or construction or implementation. Simplify your technology stack, collapse the stack, minimize the number of layers and servers.

Test… test… test….
Testing for reliability goes beyond unit testing, functional and regression testing, integration, usability and UAT. You need to test everything you can every way you can think of or can afford to.

One of the best investments that we made was building a reference test environment, as big as, and as close to the production deployment configuration, as we could afford. This allowed us to do representative system testing with production or production-like workloads, as well as variable load and stress testing, operations simulations and trials.

Stress testing is especially important: identifying the real performance limits of the system, pushing the system to, and beyond, design limits, looking for bottlenecks and saturation points, concurrency problems – race conditions and deadlocks – and observing failure of the system under load. Watching the system melt down under extreme load can give you insight into architecture, design and implementation weaknesses.

Failure handing and failover testing – creating controlled failure conditions and checking that failure detection and failure handling mechanisms work correctly.

Get the development team, especially your senior technical leaders, working closely with operations staff: understanding operations' challenges, the risks that they face, the steps that they have to go through to get their jobs done. What information do they need to troubleshoot, to investigate problems? Are the error messages clear, are you logging enough useful information? How easy is it to startup, shutdown, recover and restart – the more steps, the more problems. Make it hard for operations to make mistakes: add checks and balances. Run through deployment, configuration and upgrades together: what seems straightforward in development may have problems in the real world.

Build in health checks – simple ways to determine that the system is in a healthy, consistent state, to be used before startup, after recovery / restart, after an upgrade. Make sure operations has visibility into system state, instrumentation, logs, alerts – make sure ops know what is going on and why.

When you encounter a failure in production, work together with the operations team to complete a Root Cause Analysis, a structured investigation where the team searches for direct and contributing factors to the failure, defines corrective and preventative actions. Dig deep, look past immediate causes, keep asking why. Ask: how did this get past your checks and reviews and testing? What needs to be changed in the product? In the way that it is developed? In the way that is implemented? Operated?