Reliability fail: No automated zone failover for Coinbase’s global trading service
Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.
Deep dive on reliability failure in a major trading system, highly actionable for senior engineers.
Coinbase suffered a 10-hour global outage after its trading service, pinned to a single AWS availability zone via a Raft-based replicated cluster in a placement group, lost quorum when AWS terminated three of five matching-engine nodes. The company lacked any automated cross-zone failover, forcing engineers to ship an emergency code change to restore service — a surprising gap for a $40B platform processing $5.2T annually. The postmortem confirms the deliberate single-AZ dependency was driven by latency requirements for distributed consensus, but no manual or automated failover plan existed.