Slack's Migration to a Cellular Architecture - Slack Engineering

Summary In recent years, cellular architectures have become increasingly popular for large online services as a way to increase redundancy and limit the blast radius of site failures. In pursuit of these goals, we have migrated the most critical user-facing services at Slack from a monolithic to a c...

Click to view the original at

Hasnain says:

Good, albeit short piece. Can’t wait for others in the series.

“A naive implementation that fits these requirements would have us plumb a signal into each of our RPC clients that, when received, causes them to fail a specified percentage of traffic away from a particular AZ. This turns out to have a lot of complexity lurking within. Slack does not share a common codebase or even runtime; services in the user-facing request path are written in Hack, Go, Java, and C++. This would necessitate a separate implementation in each language. Beyond that concern, we support a number of internal service discovery interfaces including the Envoy xDS API, the Consul API, and even DNS. Notably, DNS does not offer an abstraction for something like an AZ or partial draining; clients expect to resolve a DNS address and receive a list of IPs and no more. Finally, we rely heavily on open-source systems like Vitess, for which code-level changes present an unpleasant choice between maintaining an internal fork and doing the additional work to get changes merged into upstream.”

Posted on 2023-08-29T03:55:14+0000