Definition of Metastable Failure

Here is the paper that defined the Metastable Failure: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf

The definition is a system failure mode that will remain even if the trigger has been removed.

A simple example would be an unlimited linear retry strategy. A naive approach to retries is to repeat the call after a small pause. Under normal load this can help with an occasional problem. Under extreme load when the server is failing due to excessive requests the retry policy make things worse. Assuming that the overload is at 100 requests per second, a one second retry will generate 200 requests in second 2

This is what happens:

100 requests per second with an unbounded retry policy

A breaking load becomes persistent. This is why retry policies need to have exponential backoff and a means of giving up.

This is a very simple example, more complex systems can have many ways of getting into these problems.

	Tim Mackinnon on Formal Schemas and Property…
	Carlos Herrera on Experimenting With Elixir in…
	chriseyre2000 on Thoughts On Contentful Mi…
	Joona on Thoughts On Contentful Mi…
	What is Normal? Part… on What Is Normal Anyway?

	Tim Mackinnon on Formal Schemas and Property…
	Carlos Herrera on Experimenting With Elixir in…
	chriseyre2000 on Thoughts On Contentful Mi…
	Joona on Thoughts On Contentful Mi…
	What is Normal? Part… on What Is Normal Anyway?

Definition of Metastable Failure

Published by chriseyre2000

Leave a comment Cancel reply

Share this:

Related

Published by chriseyre2000

Leave a comment Cancel reply