Software Reliability

One of my colleagues asked at a recent open spaces event whether the software industry was doing enough about Software Reliability. This was inspired by Bob Martins talk: https://www.youtube.com/watch?v=17vTLSkXTOo This raised the fear of 10K deaths caused by software and a potential future restrictions.

To start with I mentioned the aircraft practice of having three independent teams write distinct solutions that need to have two of them agree for any set of input.

The next item is the Erlang/Elixir/OTP ecosystem with its supervisor trees. This is the Erlang principle of “let it crash”. Erlang was designed to allow the software to fail and expect the machine it runs on to fail. This is why the Erlang VM is designed to be distributed – it’s the only way to protect against failure. It even allows software to be upgraded while running. This the software system that runs: telephone switches, Heroku, Rabbit MQ and Whatsapp.

Then there are tools that can help reliability:

Saboteur (https://github.com/tomakehurst/saboteur) is a tool that can inject network failures between parts of the system. This allows delays and blocks to be simulated. Systems can be tested for resilience – how they behave when the network fails and then recovers.

Gatling (https://gatling.io/) is a load test tool. This allows us to see how a system reacts under load. One place that I worked tested to either three times the peak load or to system failure. This involved having twelve instance of the application installed in aws vm’s around the world. Some or all of these could be pointed at a system with a suite of scenarios. This would have 300 users arriving per second (and then use the system) for 2 hours. A good test run would involve the system still being responsive throughout this load and then cleanly recovering afterwards.

Property Testing (https://github.com/proper-testing/proper or http://hackage.haskell.org/package/QuickCheck). These are tools that allow systems to be tested against generators that try to examine behaviour against the entire of the parameter space. Suites of random values are tested and upon a failure it attempts to find the simplest example that causes the same problem. This is documented here: https://pragprog.com/book/fhproper/property-based-testing-with-proper-erlang-and-elixir . Note that that book is still in beta.

There are the resources out there, it is up to the software development community to use them to raise their game.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s