Paxos: The Democracy of Distributed Systems

Imagine you’re running a bank with branches in New York, London, and Tokyo. A customer has $1500 in their account. At the exact same moment, they walk into the New York branch asking to withdraw $1000, try to withdraw $1000 through the London ATM, and attempt another $1000 withdrawal via the Tokyo mobile app.

Here’s the challenge: each branch only sees part of the story. All three systems think the account has the full $1500 available. Without coordination, all three withdrawals could go through, overdrawing the account by $1500.

This is the distributed consensus problem: getting multiple computers to agree on a single truth, even when they can’t all talk to each other reliably. And there’s an algorithm that solves this challenge elegantly: Paxos.

In this post, we’ll explore how Paxos ensures that distributed systems stay consistent, even when networks fail and servers crash.

The Problem: Getting Everyone to Agree is Hard

Let’s start with a story that shows just how tricky consensus can be.

The Meeting Room Booking Nightmare

Imagine you work at a company with 5 offices around the world. Each office has a system that manages meeting room bookings. When someone in New York books Conference Room A for 2 PM, that booking needs to be synchronized across all offices.

Here’s what should happen:

New York office receives booking request
All 5 offices agree on the booking
Booking is confirmed and stored everywhere
No conflicts or double bookings

But what could go wrong?

graph TD
    A[New York: Book Room A at 2 PM] --> B[Send to London]
    A --> C[Send to Tokyo]
    A --> D[Send to Sydney] 
    A --> E[Send to Berlin]
    
    B --> F["London: <i class='fas fa-check-circle text-success'></i> Agree"]
    C --> G["Tokyo: <i class='fas fa-times-circle text-danger'></i> Network Down"]
    D --> H["Sydney: <i class='fas fa-check-circle text-success'></i> Agree"]
    E --> I["Berlin: <i class='fas fa-times-circle text-danger'></i> Server Crashed"]
    
    style G fill:#ffcccc
    style I fill:#ffcccc

Now you have a problem:

New York, London, and Sydney think the booking is confirmed
Tokyo and Berlin have no idea about the booking
Someone in Tokyo might book the same room for the same time

This is the distributed consensus problem.

Why Simple Solutions Don’t Work

You might think, “Just have one central server handle all bookings!” But that creates new problems:

Single Point of Failure: If that server goes down, nobody can book anything anywhere.

The Split Brain Problem: What if the central server loses connection to some offices but not others?

graph TD
    A[Central Server] -.->|Network Cut| B[New York]
    A -.->|Network Cut| C[London]
    A --> D[Tokyo]
    A --> E[Sydney]
    A --> F[Berlin]
    
    G[Backup Server] --> B
    G --> C
    
    style A fill:#ffcccc
    style G fill:#ffe0b2

Now you have two servers both thinking they’re in charge. Tokyo books Conference Room A through the main server, while New York books the same room through the backup server. Chaos ensues.

The Solution: Paxos Explained

Paxos solves this by turning consensus into a democratic voting process. Instead of relying on one leader, it lets the servers vote on every decision.

Think of it like this analogy:

The Group Chat Analogy

Imagine you and 4 friends are trying to decide where to go for dinner using a group chat. Here’s how it works:

Check Phase: Someone asks “Can we decide on dinner now?”
Proposal Phase: Once enough people respond “yes”, they suggest “How about Pizza Palace?”

But there are rules to prevent chaos:

Only one dinner decision at a time
Once you say “yes, let’s decide”, you can’t ignore that for an older suggestion
You need majority agreement (3 out of 5) to confirm the restaurant
If someone was offline, they see the final decision when they check their phone

That’s essentially how Paxos works.

How Paxos Works in Practice

Here’s what happens when New York wants to book Conference Room A:

sequenceDiagram
    participant NY as New York
    participant L as London
    participant T as Tokyo
    participant S as Sydney
    participant B as Berlin
    
    Note over NY: Want to book Room A at 2 PM
    
    NY->>L: Prepare: Proposal #1
    NY->>T: Prepare: Proposal #1  
    NY->>S: Prepare: Proposal #1
    NY->>B: Prepare: Proposal #1
    
    L-->>NY: Promise: OK, proposal #1
    T--xNY: (Network down)
    S-->>NY: Promise: OK, proposal #1
    B-->>NY: Promise: OK, proposal #1
    
    Note over NY: Got majority (3/5), proceed
    
    NY->>L: Accept: Book Room A at 2 PM
    NY->>S: Accept: Book Room A at 2 PM  
    NY->>B: Accept: Book Room A at 2 PM
    
    L-->>NY: Accepted!
    S-->>NY: Accepted!
    B-->>NY: Accepted!

Even though Tokyo was down, the booking still went through because a majority (3 out of 5) agreed.

What happens when Tokyo comes back online?

When Tokyo’s network connection is restored, it discovers that Conference Room A has already been booked for 2 PM. Tokyo learns about this decision by checking with the other offices. Since the majority already agreed and the booking is finalized, Tokyo simply updates its records to match the consensus. This ensures all offices stay synchronized, even if some were temporarily offline during the decision-making process.

Real-World Examples

Google Chubby: The Lock Service That Runs the Internet

Google uses Paxos in their Chubby lock service, which coordinates thousands of servers across their data centers. When you search on Google, Chubby helps ensure:

Critical resources are properly coordinated across servers
Search indexes stay consistent across data centers
No two servers try to update the same data simultaneously

Without Paxos, Google’s search would be a mess of duplicate work and inconsistent results.

Apache Cassandra: Database Consensus

Cassandra implements the Paxos algorithm for lightweight transactions to ensure consistency. When you run:

INSERT INTO users (id, email) VALUES (123, 'john@example.com') 
IF NOT EXISTS;

Cassandra uses Paxos to coordinate between nodes, ensuring that only one server can successfully insert that record, even if multiple clients attempt the same operation simultaneously across different data centers.

etcd: Kubernetes’ Brain

Kubernetes uses etcd to store cluster configuration, and etcd uses Raft (another consensus algorithm) for achieving agreement. Every time you deploy a new pod, decisions about cluster state and resource allocation are coordinated through consensus.

graph LR
    A[kubectl deploy app] --> B[etcd cluster]
    B --> C[Node 1]
    B --> D[Node 2] 
    B --> E[Node 3]
    
    style B fill:#e1f5fe

The Paxos Algorithm: Under the Hood

Now let’s dive into how Paxos actually works. Don’t worry, I’ll keep it simple.

The Two Phases of Paxos

Paxos has two main phases that repeat for every decision:

Phase 1: Prepare (The “May I?” Phase)

graph TD
    A[Proposer] --> B["Send: Prepare(n)"]
    B --> C[Acceptor 1]
    B --> D[Acceptor 2] 
    B --> E[Acceptor 3]
    
    C --> F["Promise(n) or Reject"]
    D --> G["Promise(n) or Reject"]
    E --> H["Promise(n) or Reject"]
    
    style A fill:#e3f2fd
    style C fill:#f3e5f5
    style D fill:#f3e5f5
    style E fill:#f3e5f5

The proposer asks: “May I propose value X with proposal number N?”

Acceptors respond: “Yes, but only if N is higher than any proposal number I’ve already promised to consider.”

Phase 2: Accept (The “Here’s My Proposal” Phase)

graph TD
    A[Proposer] --> B["Send: Accept(n, value)"]
    B --> C[Acceptor 1]
    B --> D[Acceptor 2]
    B --> E[Acceptor 3]
    
    C --> F["Accepted or Rejected"]
    D --> G["Accepted or Rejected"] 
    E --> H["Accepted or Rejected"]
    
    style A fill:#e3f2fd
    style C fill:#e8f5e8
    style D fill:#e8f5e8
    style E fill:#e8f5e8

The proposer says: “I propose value X with proposal number N.”

Acceptors respond: “I accept it” (if they haven’t promised a higher numbered proposal).

Why the Proposal Numbers Matter

Think of proposal numbers like timestamps. Here’s why they’re important:

Imagine Alice and Bob both want to plan the group dinner at the same time:

Alice asks the group: “Can we decide on dinner? (Proposal #1)”
Bob asks the group: “Can we decide on dinner? (Proposal #2)”
The group says “yes” to both, but Bob’s request came later (higher number)
Alice suggests: “Let’s go to Pizza Palace (Proposal #1)”
Bob suggests: “Let’s go to Burger Barn (Proposal #2)”

Since the group already committed to considering Bob’s newer request (#2), they ignore Alice’s older suggestion (#1) and go with Bob’s choice.

The rule is simple: Once you promise to consider a newer proposal, you can’t go back to an older one. This prevents confusion and ensures the group always moves forward with the most recent decision.

The Tradeoffs: When Paxos Isn’t Perfect

Paxos is powerful, but it comes with costs:

Performance Impact

Latency: Every decision needs multiple network round trips
Throughput: Basic Paxos can only handle one decision at a time
Network Traffic: Lots of messages flying around

Complexity Costs

Implementation Difficulty: Paxos is notoriously hard to implement correctly
Debugging Challenges: Consensus bugs are subtle and hard to reproduce
Operational Overhead: Need to monitor proposal numbers, leader elections, etc.

Availability Limits

Needs Majority: Cannot tolerate more than N/2 server failures
Network Partitions: Small partitions can’t make progress
Split Brain: Careful design needed to prevent multiple leaders

graph TD
    A[5 Servers] --> B[Network Split]
    B --> C[Group 1: 2 servers]
    B --> D[Group 2: 3 servers]
    
    C --> E["Can't reach consensus<br/>(No majority)"]
    D --> F["Can reach consensus<br/>(Has majority)"]
    
    style E fill:#ffcdd2
    style F fill:#c8e6c9

When to Use Paxos: A Decision Framework

Use Paxos when:

Strong consistency is non-negotiable
You have multiple servers/data centers
You can handle the implementation complexity
Availability under failures is critical

Skip Paxos when:

Eventually consistent is good enough
You have very low latency requirements
Simple single-master setup works
Implementation resources are limited

Paxos vs. Other Consensus Algorithms

Raft: The Understandable Alternative

Raft was designed to be easier to understand than Paxos:

Aspect	Paxos	Raft
Complexity	High	Medium
Performance	Excellent	Good
Adoption	Widespread	Growing
Understanding	Difficult	Easier

PBFT: For Byzantine Faults

Practical Byzantine Fault Tolerance (PBFT) handles malicious actors and arbitrary failures, while Paxos assumes nodes either work correctly or fail completely:

graph TD
    A[Node Failures] --> B[Crash Failures]
    A --> C[Byzantine Failures]
    
    B --> D["Paxos/Raft<br/>Handles Well"]
    C --> E["PBFT/Tendermint<br/>Required"]
    
    style D fill:#c8e6c9
    style E fill:#e1f5fe

Wrapping Up

Paxos is like democracy for distributed systems. It can be complex and sometimes slow, but it reliably achieves consensus even when networks fail and servers crash. While it’s challenging to implement correctly, it’s the foundation that keeps many of the world’s largest systems running smoothly.

Have you implemented Paxos in your systems? Share your experiences in the comments below!