Definition of Metastable Failure

Here is the paper that defined the Metastable Failure: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf

The definition is a system failure mode that will remain even if the trigger has been removed.

A simple example would be an unlimited linear retry strategy. A naive approach to retries is to repeat the call after a small pause. Under normal load this can help with an occasional problem. Under extreme load when the server is failing due to excessive requests the retry policy make things worse. Assuming that the overload is at 100 requests per second, a one second retry will generate 200 requests in second 2

This is what happens:

100 requests per second with an unbounded retry policy

A breaking load becomes persistent. This is why retry policies need to have exponential backoff and a means of giving up.

This is a very simple example, more complex systems can have many ways of getting into these problems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s