Here is the paper that defined the Metastable Failure: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf
The definition is a system failure mode that will remain even if the trigger has been removed.
A simple example would be an unlimited linear retry strategy. A naive approach to retries is to repeat the call after a small pause. Under normal load this can help with an occasional problem. Under extreme load when the server is failing due to excessive requests the retry policy make things worse. Assuming that the overload is at 100 requests per second, a one second retry will generate 200 requests in second 2
This is what happens:
A breaking load becomes persistent. This is why retry policies need to have exponential backoff and a means of giving up.
This is a very simple example, more complex systems can have many ways of getting into these problems.