placeholder

How We Built a Self-Healing System to Survive a Terrifying Concurrency Bug At Netflix

Our CPUs were dying, the bug was temporarily un-fixable, and we had no viable path forward. Here's how we managed to survive.

Click to view the original at pushtoprod.substack.com

Hasnain says:

This was a great read

“I’ve always loved this incident for a few reasons:

This was a rare but brutal example of how writing non-thread-safe code can cripple your systems. There are a lot of problems you haven’t seen before because you’re not working on systems with sufficient volume to generate them.

The solution of automatically terminating random instances felt like a terrible engineering practice. But in the moment, it was the perfect solution to our problem.

Most importantly, we prioritized our own sanity”

Posted on 2024-11-13T08:08:24+0000