placeholder

Roblox Return to Service 10/28-10/31 2021 - Roblox Blog

Starting October 28th and fully resolving on October 31st, Roblox experienced a 73-hour outage. We’re sharing these technical details to give our community an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the ...

Click to view the original at blog.roblox.com

Hasnain says:

This was a really well written postmortem. It covers a lot of ground on issues that pop up when running distributed systems. From the low-level (NUMA cache coherency) to the high level (latency, replication, rolling restarts, circular dependencies), it really has it all.

Worth reading for anyone trying to learn about systems at scale.

"Roblox’s core infrastructure runs in Roblox data centers. We deploy and manage our own hardware, as well as our own compute, storage, and networking systems on top of that hardware. The scale of our deployment is significant, with over 18,000 servers and 170,000 containers."

Posted on 2022-01-21T04:26:02+0000