Software Reliability

One of my colleagues asked at a recent open spaces event whether the software industry was doing enough about Software Reliability. This was inspired by Bob Martins talk: https://www.youtube.com/watch?v=17vTLSkXTOo This raised the fear of 10K deaths caused by software and a potential future restrictions.

To start with I mentioned the aircraft practice of having three independent teams write distinct solutions that need to have two of them agree for any set of input.

The next item is the Erlang/Elixir/OTP ecosystem with its supervisor trees. This is the Erlang principle of “let it crash”. Erlang was designed to allow the software to fail and expect the machine it runs on to fail. This is why the Erlang VM is designed to be distributed – it’s the only way to protect against failure. It even allows software to be upgraded while running. This the software system that runs: telephone switches, Heroku, Rabbit MQ and Whatsapp.

Then there are tools that can help reliability:

Saboteur (https://github.com/tomakehurst/saboteur) is a tool that can inject network failures between parts of the system. This allows delays and blocks to be simulated. Systems can be tested for resilience – how they behave when the network fails and then recovers.

Gatling (https://gatling.io/) is a load test tool. This allows us to see how a system reacts under load. One place that I worked tested to either three times the peak load or to system failure. This involved having twelve instance of the application installed in aws vm’s around the world. Some or all of these could be pointed at a system with a suite of scenarios. This would have 300 users arriving per second (and then use the system) for 2 hours. A good test run would involve the system still being responsive throughout this load and then cleanly recovering afterwards.

Property Testing (https://github.com/proper-testing/proper or http://hackage.haskell.org/package/QuickCheck). These are tools that allow systems to be tested against generators that try to examine behaviour against the entire of the parameter space. Suites of random values are tested and upon a failure it attempts to find the simplest example that causes the same problem. This is documented here: https://pragprog.com/book/fhproper/property-based-testing-with-proper-erlang-and-elixir . Note that that book is still in beta.

There are the resources out there, it is up to the software development community to use them to raise their game.

 

Elixir Supervisor Introduction

This is a simple introduction to OTP in Elixir.

The sample is based heavily upon Introducing Elixir.

This will demonstrate how Elixir keeps a server alive across code recompilation, and restores a server after it has crashed.

This assumes that you have installed Elixir.

On a mac you can install this with (there are other solutions on other platforms):

brew install elixir

Start by cloning the repo:

git clone https://github.com/chriseyre2000/drop_server.git

Change directory into the drop_server folder.

Type

iex -S mix

This will start the application running inside the interactive Elixir REPL.

This application has a DropServer.Worker being monitored by a supervisor.

First you can calculate the velocity after falling a distance in meters:

(Type the bit after the iex> prompts):

iex(1)> DropServer.Worker.calculate_drop(40)

{“ok”, 28.0}

iex(2)> DropServer.Worker.calculate_drop(41)

{“ok”, 28.347839423843222}

iex(3)> DropServer.Worker.calculate_drop(42)

{“ok”, 28.691462144686877}

iex(4)> DropServer.Worker.how_many_calls

So far calculated 3 velocities.

:ok

You can even recompile the DropServer.Worker while it is running:

iex(5)> c(“lib/drop_server/drop_server.ex”)

warning: redefining module DropServer.Worker (current version loaded from _build/dev/lib/drop_server/ebin/Elixir.DropServer.Worker.beam)

lib/drop_server/drop_server.ex:1

warning: redefining module DropServer.Worker.State (current version loaded from _build/dev/lib/drop_server/ebin/Elixir.DropServer.Worker.State.beam)

lib/drop_server/drop_server.ex:4

[DropServer.Worker, DropServer.Worker.State]

This even maintains the state:

iex(6)> DropServer.Worker.how_many_calls

So far calculated 3 velocities.

:ok

Now if you give it some invalid data Elixir will do the classic Erlang thing and let it crash!

iex(7)> DropServer.Worker.calculate_drop(-1)

21:49:21.278 [error] GenServer DropServer.Worker terminating

** (ArithmeticError) bad argument in arithmetic expression

(stdlib) :math.sqrt(-19.6)

(drop_server) lib/drop_server/drop_server.ex:44: DropServer.Worker.fall_velocity/1

(drop_server) lib/drop_server/drop_server.ex:20: DropServer.Worker.handle_call/3

(stdlib) gen_server.erl:661: :gen_server.try_handle_call/4

(stdlib) gen_server.erl:690: :gen_server.handle_msg/6

(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

Last message (from #PID<0.133.0>): -1

State: %DropServer.Worker.State{count: 3}

Client #PID<0.133.0> is alive

(stdlib) gen.erl:169: :gen.do_call/4

(elixir) lib/gen_server.ex:921: GenServer.call/3

(stdlib) erl_eval.erl:680: :erl_eval.do_apply/6

(elixir) src/elixir.erl:265: :elixir.eval_forms/4

(iex) lib/iex/evaluator.ex:249: IEx.Evaluator.handle_eval/5

(iex) lib/iex/evaluator.ex:229: IEx.Evaluator.do_eval/3

(iex) lib/iex/evaluator.ex:207: IEx.Evaluator.eval/3

(iex) lib/iex/evaluator.ex:94: IEx.Evaluator.loop/1

** (exit) exited in: GenServer.call(DropServer.Worker, -1, 5000)

** (EXIT) an exception was raised:

** (ArithmeticError) bad argument in arithmetic expression

(stdlib) :math.sqrt(-19.6)

(drop_server) lib/drop_server/drop_server.ex:44: DropServer.Worker.fall_velocity/1

(drop_server) lib/drop_server/drop_server.ex:20: DropServer.Worker.handle_call/3

(stdlib) gen_server.erl:661: :gen_server.try_handle_call/4

(stdlib) gen_server.erl:690: :gen_server.handle_msg/6

(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

(elixir) lib/gen_server.ex:924: GenServer.call/3

However only the state is lost:

iex(7)> DropServer.Worker.how_many_calls

So far calculated 0 velocities.

:ok

iex(8)> DropServer.Worker.calculate_drop(10)

{“ok”, 14.0}

iex(9)> DropServer.Worker.how_many_calls

So far calculated 1 velocities.

:ok

The supervisor has restarted the server.

In a real service the state would have been kept in a distinct service possibly with some form of persistence.

80/20 Principle and Optimizing

The following is the means of finding the directory size of a folder and its contents on a mac:

du -s

This came in handy when I was trying to speed up a website deploy.

It turns out that over 98% of the content of the site was in one folder. Spliting this off an changing the copy policy saves between 50% and 75% of the deploy time.

Look for the quick wins!

Elixir Metaprogramming : the basics

Elixir has a very small core language. Most of what is thought of as the syntax is actually written using macros.

Elixir makes it ridiculously easy to get at the Abstract Syntax Tree (AST) of the code that you are using.

How to get the AST of some code:

iex(1)> quote do: 1 + 1
{:+, [context: Elixir, import: Kernel], [1, 1]}

New Machine Setup

I have recently moved jobs. This comes with the inevitable machine setup issue.

This is a list of things that I have been installing on my machine so that I won’t have to build myself another list:

  • Chrome
  • homebrew
  • neo4j
  • java8
  • elixir
  • exercism
  • vscode
    • Enable autosave (on the file menu)
    • GitLens
    • IntelliJ IDEA Keybindings
    • JS Refactor
  • Gradle
  • git ssh setup

After every mac os upgrade:

xcode-select –install

Development Practices That I Want To Use In Future Roles

I have just left one company for a new role. These are a few notes that I would like to bring forward to future companies:

  • Empower the devs add cloud resources within a limit say $100 per month without having to ask. The meetings to request this will cost more than the benefits. Dev teams will spend less time waiting.
  • Have a bell that can be used to trigger team timeouts. Hold the discussions as soon as they are needed. Decisions should be made by the team. Go one better than us and actually write up the decision including why’s and assumptions.
  • When adding a new piece of technology always consider how to replace it. Cloud services do stop with anything between 1 and 6 months notice.
  • Actively monitor the cloud infrastructure bills, both current and projected. Warn the team daily if a threshold is breached.
  • Never let the build stay broken. You can only ignore tests for a fixed period of days. This is only to be used for infrastructure failures.
  • Deploy frequently. Weekly at a minimum. Code not deployed will rot. Use feature switches if possible (but ruthlessly remove them after a feature is live). This does allow deploy from master.

Script to help review Exercism.io Elixir


@rem fix a.test.exs
@rem removes the pending tag from the test and adds _2 to filename
@echo %1 | sed -e "s/.exs/_2.exs/" > temp.txt
@set /p Filename=<temp.txt
@cat %1 | grep -v "@tag :pending" > %Filename%
@rm temp.txt
elixir %Filename%

view raw

fix_test.bat

hosted with ❤ by GitHub

Or on a mac/linux

grep -v @tag *_test.exs > test.exs && elixir test.exs