Skip to content

Reliability fail: No automated zone failover for Coinbase’s global trading service

8.1 relevance
Score Breakdown
technical depth
9
novelty
7
actionability
8
community
8
strategic
7
personal
9

Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.

Deep dive on reliability failure in a major trading system, highly actionable for senior engineers.

AI/ML blog.pragmaticengineer.com
Reliability fail: No automated zone failover for Coinbase’s global trading service
Summary

Coinbase suffered a 10-hour global outage after its trading service, pinned to a single AWS availability zone via a Raft-based replicated cluster in a placement group, lost quorum when AWS terminated three of five matching-engine nodes. The company lacked any automated cross-zone failover, forcing engineers to ship an emergency code change to restore service — a surprising gap for a $40B platform processing $5.2T annually. The postmortem confirms the deliberate single-AZ dependency was driven by latency requirements for distributed consensus, but no manual or automated failover plan existed.

Author

Gergely Orosz

More from Gergely Orosz →