placeholder

Good Retry, Bad Retry: An Incident Story

Sometimes, a seemingly simple and obvious solution can lead to a series of problems later on. This is especially true when adding retries.

Click to view the original at medium.com

Hasnain says:

Long, detailed read on distributed systems and retries. Worth reading and internalizing.

“A couple of months after the incident review was completed, the platform team adopted the retry budget technique with a threshold of 10%. Over the following year, there were other incidents, but no amplification from retries was observed.
Thanks to the incident review, Ben learned that “when encountering transient errors, add retries” is a risky approach. He gained an in-depth understanding of the risks involved, even with exponential backoff. He learned about exponential backoff and jitter techniques, Little’s Law and closed-loop systems, the concept of metastable failure state, the problem of retry amplification and techniques like retry circuit breaker and retry budget, as well as circuit breaker and deadline propagation mechanisms.
Ben now has an even more exciting journey ahead of him, delving deeper into retries. But that’s the subject of another post.”

Posted on 2024-10-08T07:19:19+0000