Definition of Metastable Failure

Here is the paper that defined the Metastable Failure: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf

The definition is a system failure mode that will remain even if the trigger has been removed.

A simple example would be an unlimited linear retry strategy. A naive approach to retries is to repeat the call after a small pause. Under normal load this can help with an occasional problem. Under extreme load when the server is failing due to excessive requests the retry policy make things worse. Assuming that the overload is at 100 requests per second, a one second retry will generate 200 requests in second 2

This is what happens:

100 requests per second with an unbounded retry policy

A breaking load becomes persistent. This is why retry policies need to have exponential backoff and a means of giving up.

This is a very simple example, more complex systems can have many ways of getting into these problems.

Bad Interfaces

Some years ago I was working on the replacement version of a big product. We had some of the leading clients using the new version while most were still using the old version. There was a need for a new piece of functionality that would eventually be used by both systems but be introduced into the older system first (as clients get what they want).

To make life easier we asked the team building it to build the new utility as a distinct product that would be integrated via an API. This would have made moving it over simply a case of integrating the API into the new system.

We made the mistake of letting the team define the interface.

A year later when we came to integrate it we found the following contract:

interface ITransfer {

string action(string input)

}

Needless to say it was not a simple replacement job. Every call used the same interface and it was internally mapped to whatever we needed to do. Sometimes it was XML other times list of integers.

Lesson learned: Watch out for overly generic contracts, they are as good as having no contract at all.

Two Ends of The Scale

A colleague posted an article (https://learnosity.com/not-invented-here-syndrome-explained/) about Not Invented Here Syndrome. This is the tendency for a group to prefer things that they have built versus those that others have created.

The contrasting position is the We Must Do This Because Google Are Doing It (https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb). This is the tendency to try to use techniques that have been successfully used in other large organisations.

You need to know the context of a solution and try to see if it is applicable.
Wardley Maps can help here. (https://twitter.com/swardley/status/764147364605100032) Wardley Maps in Elivish.

You don’t need to invent everything yourself (https://www.ted.com/talks/thomas_thwaites_how_i_built_a_toaster_from_scratch?language=en) nor can you typically acquire everything.


Enabling Apps Downloaded to MacOS

Frequently on a mac you need to get a custom app installed. If you can use the appstore or brew it’s great. However some smaller apps are not on these platforms. The mac has great security systems to protect you from this, but the assumption that everything is malicious can prevent the happy path from being easy.

Recently I resorted to using the following:

https://osxdaily.com/2019/02/13/fix-app-damaged-cant-be-opened-trash-error-mac/

This will after a checklist lead to the following command:

xattr -cr /path/to/application.app

This removes the flag that indicates that the code has been downloaded.
Only do this if you trust the code being run (in my case it was an inhouse tool written by the company that I am contracting at).

Reducing Stress by Taking Control

A few years ago I was finding myself annoyed by the lack of documentation at work. At the time I just complained and carried on. This raised my stress level and did not help anything.

I then decided to be the change that I wanted, and started writing documentation for things that I needed. Once you have a start you can ask questions and will find that others will help expand.

This applies beyond documentation. If I find something wrong that is put down to “that violates our policy” then I ask to see the policy in writing. It’s very common for there to be talk of a policy that does not exist.

If something is wrong or annoying you try to see what can be changed, how can you help to make this happen. Sometimes this just a case of getting the right people to talk to each other.

The User Interface is not Your Domain Model

Back in 2012 in the Open Source Journal (2012 issue 4) Robert C Martin argued that the database in not the center of your application. Given that DDD is more prevent now than then I would like to make a further argument.

The main problem that some companies have in building a domain model is that they don’t understand the difference between their user interface and the model. This is a key point. For some applications these can be very similar. However for others there are processes that live beyond the UI that are not typically discussed outside of the development team. It’s this part that having a clear distinct Domain Model helps with.

Something Strange, Something Familiar

In a recent presentation I was demonstrating a new proposed architecture. Part of the demo included workflows using the architecture based upon some documentation that the customer was familiar with.

I received feedback that this was effective.

To generalise this when introducing something new to someone try to link it to something that they already know.