placeholder

2023 03 08 Incident: Infrastructure connectivity issue affecting multiple regions

Between March 8, 2023, 06:03 UTC and March 9, 2023, 08:58 UTC, Datadog experienced an infrastructure connectivity issue that caused service degradation across multiple regions.

Click to view the original at datadoghq.com

Hasnain says:

Interesting technical read about a massive recent outage. It was a bit of a bummer that they only went about publishing it when called out about their lack of transparency.

“Another common theme that all teams encountered during the service recovery was the fact that distributed, share-nothing data stores handle massive failures much better than most distributed data stores that require a quorum-based control plane. For example a fleet of independent data nodes with static shard assignment degrades roughly linearly as the number of nodes drops. The same fleet of data nodes, bound together by a quorum will operate without degradation so long as the quorum is met and refuse to operate once it’s not. Of course the quorum-based fleet is a lot easier to manage in the day-to-day, so going one way or another is not an obvious decision, but this outage highlights the need for us to re-examine past choices.”

Posted on 2023-05-17T05:35:04+0000