November 2022, 14 million Taylor Swift fans logged into Ticketmaster at the exact same time. The goal? Grab one of 2 million tickets for the Eras Tour.
Within minutes, the entire system collapsed. Fans saw spinning wheels, error messages, and phantom seats that disappeared the moment they clicked “Buy”. The crash was so catastrophic it triggered Congressional hearings and multiple lawsuits.
The shocking part? This wasn’t a DDoS attack from bots or hackers. It was legitimate fans, just trying to buy tickets.
Ticketmaster’s CEO later admitted: “The site couldn’t handle the demand.” But here’s what makes this fascinating for us developers - this is one of the hardest problems in distributed systems, disguised as a simple web form.
Think about it. You’re building a ticket booking system. How hard can it be? Just let people click a seat and pay, right?
Wrong. Dead wrong.
The Deceptively Simple Problem
Let me paint a picture. You’ve just launched your ticket booking platform. Opening night for the biggest movie of the year. 50,000 people are trying to book seats for the same 9 PM show that has only 300 seats.
Here’s what happens in the first 10 seconds:
- 1,000 people click seat A7 simultaneously
- 800 of them see “Seat selected! Proceed to payment”
- All 800 believe they own that seat
- 799 will get an error during payment
- 799 angry customers, lost sales, customer support nightmare
Now multiply this across hundreds of shows, thousands of theaters, millions of users. Welcome to your first week as a ticket booking system architect.
The real challenge isn’t just selling tickets. It’s selling tickets correctly when thousands of people want the exact same seat at the exact same millisecond.
Let me show you what actually happens in those 3 seconds between clicking “Buy” and seeing “Booking Confirmed”.
The Big Picture: What We’re Building
Before we dive into the chaos, let’s look at the complete system architecture. This is what powers platforms like BookMyShow, Fandango, and (on good days) Ticketmaster.
flowchart TD
User[User] --> Queue[Virtual Queue]
User --> CDN[CDN]
Queue --> LB[Load Balancer]
LB --> Search[Search Service]
LB --> Booking[Booking Service]
LB --> Payment[Payment Service]
Search --> ReadDB[(Read Replicas)]
Search --> Cache[(Redis Cache)]
Booking --> Lock[Distributed Lock]
Booking --> WriteDB[(Primary DB)]
Payment --> PGW[Payment Gateway]
Payment --> WriteDB
Payment --> Notify[Notifications]
WriteDB -.-> ReadDB
style Queue fill:#fff3cd,stroke:#333,stroke-width:2px
style Lock fill:#ffc107,stroke:#333,stroke-width:2px
style WriteDB fill:#90EE90,stroke:#333,stroke-width:2px
style ReadDB fill:#87CEEB,stroke:#333,stroke-width:2px
The key players:
- Virtual Queue: The bouncer. Controls who gets in and when. Prevents the Ticketmaster scenario.
- Search Service: Fast reads from replicas. Shows you what’s available.
- Booking Service: The critical path. Handles seat locks and reservations.
- Distributed Lock: The referee. Ensures only one person can claim a seat.
- Payment Service: Charges the card and confirms the booking.
Notice something crucial? Search is separated from booking. You can browse seats on read replicas, but the moment you click “Buy”, you hit the primary database with distributed locks. This separation is what allows systems to scale.
Now, let’s tackle the four impossible challenges that make or break ticket booking systems.
Challenge #1: The Race Condition From Hell
Picture this: It’s 10:00 AM sharp. Concert tickets go on sale. 50,000 people hit “Buy” for the same front-row seat at the exact same millisecond.
Here’s what happens without proper architecture:
%%{init: {'theme':'base', 'themeVariables': { 'actorBkg':'#E8F5E9','actorBorder':'#333','actorTextColor':'#333','fontSize':'16px'}}}%%
sequenceDiagram
participant U1 as User 1
participant U2 as User 2
participant U3 as User 3
participant S as Server
participant D as DB
Note over U1,D: 3 users click same time
U1->>S: SELECT A7
U2->>S: SELECT A7
U3->>S: SELECT A7
S->>D: Is A7 available?
D-->>S: Yes
S->>D: Is A7 available?
D-->>S: Yes
S->>D: Is A7 available?
D-->>S: Yes
Note over S,D: All see available
U1->>S: Book A7
U2->>S: Book A7
U3->>S: Book A7
S->>D: UPDATE booked
S->>D: UPDATE booked
S->>D: UPDATE booked
Note over U1,D: 3 bookings for 1 seat!
This is called a race condition. Multiple processes racing to modify the same resource. The result? Double bookings, triple bookings, angry customers, and you explaining to your CEO why you just oversold 2,000 seats.
Solution #1: Optimistic Locking (The Hopeful Approach)
The idea: “Let everyone try, but only let the winner succeed.”
Every row gets a version number. When you update, you check if the version is still the same. If someone else changed it, you fail.
graph LR
DB[(Seat A7
version: 5)]
DB --> U1[User 1
reads v5]
DB --> U2[User 2
reads v5]
U1 --> C1{Check
version}
U2 --> C2{Check
version}
C1 -->|Match| Win[Update to v6
BOOKED]
C2 -->|Changed| Fail[Version mismatch
RETRY]
style Win fill:#90EE90,stroke:#333,stroke-width:2px
style Fail fill:#FFB6B6,stroke:#333,stroke-width:2px
When to use it:
- Low contention scenarios (movie bookings during off-peak hours)
- When retries are acceptable
- When reads vastly outnumber writes
When NOT to use it:
- High contention (concert tickets, IPL matches)
- Payment already processed (can’t just retry)
- User experience suffers from failed attempts
Solution #2: Pessimistic Locking (The Careful Approach)
The idea: “Lock the seat the moment someone selects it.”
When User 1 selects a seat, the database locks that row. No one else can even read it until the lock is released.
%%{init: {'theme':'base', 'themeVariables': { 'actorBkg':'#E8F5E9','actorBorder':'#333','actorTextColor':'#333','fontSize':'16px'}}}%%
sequenceDiagram
participant U1 as User 1
participant U2 as User 2
participant D as DB
U1->>D: Lock row
Note over D: Locked
D-->>U1: OK
U2->>D: Lock row
Note over D: Waiting...
U1->>D: Update + commit
Note over D: Released
D-->>U2: OK
U2->>D: Check
D-->>U2: Booked
The problem: What if User 1’s payment fails? Or their browser crashes? Or they just walk away? That seat stays locked until the transaction times out.
When to use it:
- Medium contention scenarios
- Short transaction windows (under 30 seconds)
- Strong consistency requirements
The hidden cost: Database connections are expensive. Each locked seat ties up a connection. With 10,000 people browsing, you’d need 10,000 connections. Most databases max out at 500-1000.
Solution #3: Distributed Locks (The Battle-Tested Approach)
This is what BookMyShow and Ticketmaster (eventually) used. The idea: “Use a fast, external system to manage locks.”
graph TB
U1[User 1] --> Lock[Redis Lock Manager]
U2[User 2] --> Lock
U3[User 3] --> Lock
Lock --> Key[Lock: seat A7
TTL: 5 min]
Key --> W[User 1
GRANTED]
Lock --> L1[User 2
DENIED]
Lock --> L2[User 3
DENIED]
style W fill:#90EE90,stroke:#333,stroke-width:2px
style L1 fill:#FFB6B6,stroke:#333,stroke-width:2px
style L2 fill:#FFB6B6,stroke:#333,stroke-width:2px
style Lock fill:#ffc107,stroke:#333,stroke-width:2px
How it works:
- User selects seat A7
- Server tries to acquire lock:
SET seat:show123:A7 user1_session EX 300 NX
- Redis returns success only to the first request
- Other users see “This seat is currently held by someone else”
- Lock expires after 5 minutes if payment isn’t completed
Why Redis?
- Blazing fast (sub-millisecond response)
- Atomic operations (multiple requests, one winner)
- Built-in TTL (automatic lock expiry)
- Can handle millions of locks simultaneously
The architecture:
flowchart TB
Select[User selects seat] --> TryLock[Try acquire lock]
TryLock --> Master[Redis Master]
Master --> CheckLock{Lock acquired?}
CheckLock -->|Yes| ShowPay[Show payment page]
CheckLock -->|No| ShowError[Seat unavailable]
ShowPay --> Process[Process payment]
Process --> Confirm[Confirm booking]
Confirm --> Release[Release lock]
Release --> Master
Master -.->|Replicate| Slave1[Redis Replica 1]
Master -.->|Replicate| Slave2[Redis Replica 2]
style ShowPay fill:#90EE90,stroke:#333,stroke-width:2px
style Confirm fill:#90EE90,stroke:#333,stroke-width:2px
style ShowError fill:#FFB6B6,stroke:#333,stroke-width:2px
style Master fill:#ffc107,stroke:#333,stroke-width:2px
When to use it:
- High contention (flash sales, popular events)
- Multiple application servers
- Need for automatic timeout/expiry
- Scalability is critical
Challenge #2: The Phantom Seat Problem
You’ve acquired the lock. The seat is yours for 5 minutes. Now what?
Here’s the tricky part: those 5 minutes need careful orchestration. This is where most systems fail.
%%{init: {'theme':'base', 'themeVariables': { 'actorBkg':'#E8F5E9','actorBorder':'#333','actorTextColor':'#333','fontSize':'16px'}}}%%
sequenceDiagram
participant U as User
participant A as App
participant L as Lock
participant D as DB
participant P as Payment
participant J as Job
U->>A: Select seat
A->>L: Acquire lock
L-->>A: OK
A->>D: Create reservation
D-->>A: ID: R123
A-->>U: Enter payment
Note over U,P: User enters details
alt Success
U->>P: Pay
P-->>A: Confirmed
A->>D: CONFIRMED
A->>L: Release
A-->>U: Booked!
else Failed
U->>P: Pay
P-->>A: Failed
A->>D: FAILED
A->>L: Release
A-->>U: Try again
else Timeout
Note over U: Closes browser
J->>D: Check expired
J->>D: Delete R123
J->>L: Release
Note over L: Available
end
The reservation lifecycle:
- Lock acquired: Seat is temporarily held (5 minutes)
- Temporary reservation: Database record with status “PENDING”
- Payment window: User has time to complete payment
- Three possible outcomes:
- Success: Payment clears, status becomes “CONFIRMED”
- Failure: Payment fails, lock released immediately
- Timeout: Background job cleans up after 5 minutes
Critical insight: Never trust the lock alone. Always maintain state in the database. Locks can fail, Redis can restart, but the database persists.
The Background Job Architecture
flowchart LR
Job[Background Job] --> Query[Find expired reservations]
Query --> Check{Payment done?}
Check -->|No| Delete[Delete reservation]
Delete --> Release[Release lock]
Release --> Notify[Notify user]
Check -->|Yes| Complete[Mark CONFIRMED]
style Complete fill:#90EE90,stroke:#333,stroke-width:2px
style Delete fill:#FFB6B6,stroke:#333,stroke-width:2px
style Check fill:#fff3cd,stroke:#333,stroke-width:2px
What can go wrong:
- User’s payment is processing: Background job deletes their reservation
- Network delay: Payment succeeded but confirmation never reached server
- Redis restart: Locks lost, chaos ensues
The solution: Two-phase confirmation with idempotency.
Challenge #3: The Flash Sale Tsunami
Let’s talk about the elephant in the room - the Taylor Swift problem. How do you handle 14 million people trying to access your system simultaneously?
The naive approach: “Just add more servers!”
Reality check: Doesn’t work. Here’s why.
flowchart TB
M1[1M users] -->|All at once| LB1[Load Balancer]
M2[1M users] -->|All at once| LB1
M3[1M users] -->|All at once| LB1
LB1 --> S1[Server 1]
LB1 --> S2[Server 2]
LB1 --> S3[Server 3]
S1 --> DB1[(Database
OVERLOADED)]
S2 --> DB1
S3 --> DB1
style S1 fill:#FFB6B6,stroke:#333,stroke-width:2px
style S2 fill:#FFB6B6,stroke:#333,stroke-width:2px
style S3 fill:#FFB6B6,stroke:#333,stroke-width:2px
style DB1 fill:#FFB6B6,stroke:#333,stroke-width:2px
style LB1 fill:#FFB6B6,stroke:#333,stroke-width:2px
What happens:
- 3 million requests hit 100 servers
- Each server opens 10 database connections
- Database gets 1,000 connections (it can handle 500)
- Database crashes
- Entire system goes down
- Everyone loses
The Queue System: Controlled Entry
The solution isn’t more capacity. It’s controlled entry. Think of it like a nightclub with a velvet rope.
flowchart TB
Users[14M Users] --> Entry[Entry Point]
Entry --> Queue[Virtual Queue
Position assigned]
Queue --> Wait[Estimated wait
23 minutes]
Wait --> Browse[Keep browsing]
Queue --> RateLimit[Rate Limiter
10K/min]
RateLimit --> Turn[Your turn!
10 min window]
RateLimit --> App[Application]
App --> DB[(Database)]
style Queue fill:#fff3cd,stroke:#333,stroke-width:2px
style RateLimit fill:#ffc107,stroke:#333,stroke-width:2px
style DB fill:#90EE90,stroke:#333,stroke-width:2px
How it works:
- 10 AM: Tickets go on sale
- 14M users: All connect at once
- Virtual queue: Assigns everyone a position
- Rate limiting: Let 10,000 users through per minute
- Controlled load: System operates smoothly
- Fair access: First come, first served (mostly)
Key metrics:
- Queue capacity: Unlimited (Redis can handle billions of entries)
- Throughput: 10,000 users/minute (configurable)
- Wait time: Transparently shown to users
- Dropout handling: If someone leaves, next person advances
The Multi-Layer Traffic Control
Real systems use multiple layers of protection:
flowchart TB
User[User] --> CDN[Layer 1: CDN]
CDN --> Queue[Layer 2: Virtual Queue]
Queue --> RL[Layer 3: Rate Limiting]
RL --> App1[Server 1]
RL --> App2[Server 2]
RL --> App3[Server 3]
App1 --> Primary[(Primary DB)]
App2 --> Primary
App3 --> Primary
App1 --> Replica1[(Read Replica 1)]
App2 --> Replica2[(Read Replica 2)]
App3 --> Replica1
style CDN fill:#87CEEB,stroke:#333,stroke-width:2px
style Queue fill:#fff3cd,stroke:#333,stroke-width:2px
style RL fill:#ffc107,stroke:#333,stroke-width:2px
style Primary fill:#90EE90,stroke:#333,stroke-width:2px
style Replica1 fill:#87CEEB,stroke:#333,stroke-width:2px
style Replica2 fill:#87CEEB,stroke:#333,stroke-width:2px
Why this works:
- CDN: Serves images, CSS, JavaScript. Takes 90% of the load off your servers.
- Queue: Prevents the stampede. System processes users at sustainable rate.
- Rate limiting: Stops one user from hogging resources with 1000 requests/second.
- Read replicas: Search queries go to replicas. Writes go to primary. Read/write separation.
Challenge #4: Payment Processing Nightmares
You’ve fought through race conditions, acquired locks, survived the queue. Now comes the final boss: payment processing.
Here’s what makes it terrifying: payments can fail in 15 different ways.
The 15 ways payment can fail:
- Card declined (insufficient funds)
- Network timeout (no response from gateway)
- Gateway is down (service unavailable)
- Invalid card details
- Card expired
- Fraud detection triggered
- 3D Secure authentication required
- Payment processing (still pending)
- Bank server timeout
- Payment gateway rate limiting
- Duplicate transaction detected
- Currency conversion failure
- Payment amount mismatch
- Session expired
- User canceled payment
The worst part: You don’t know if the payment actually went through during a timeout. The money might have left the customer’s account, but you never got confirmation.
The Idempotency Pattern (Stolen from Stripe)
Remember how Stripe prevents double payments? Same principle here.
Every payment request gets a unique idempotency key (e.g., booking_R789456_attempt_1697123456
). When a user retries after a timeout, the system checks if that key already exists. If the payment succeeded the first time, return the cached result instead of charging again.
Key implementation points for ticket booking:
- Generate idempotency key:
booking_{reservation_id}_attempt_{timestamp}
- Store payment result in cache (Redis) with the key
- On retry, check cache first before calling payment gateway
- If found, return cached result - no double charge
- Key expires after 24 hours (long enough for any legitimate retry)
The Reconciliation Safety Net
Even with idempotency, things can go wrong. That’s why you need reconciliation.
flowchart TB
Find[Find PENDING
payments] --> Query[Query Payment
Gateway]
Query --> Compare{Gateway
Status?}
Compare -->|Paid| ConfirmDB[Mark CONFIRMED]
Compare -->|Failed| FailDB[Mark FAILED]
Compare -->|Unknown| Alert[Alert Ops Team]
style ConfirmDB fill:#90EE90,stroke:#333,stroke-width:2px
style FailDB fill:#FFB6B6,stroke:#333,stroke-width:2px
style Alert fill:#fff3cd,stroke:#333,stroke-width:2px
What reconciliation catches:
- Payments that succeeded but confirmation was lost
- Zombie bookings (PENDING forever)
- Gateway discrepancies
- Double charges (rare but possible)
Real-World Battle Stories
Let’s learn from the titans who fought these battles before us.
BookMyShow: From Zero to 50 Million Users
BookMyShow started in 2007 with a simple PHP monolith. By 2016, they were processing 1 million tickets per day across India.
Their original architecture:
- Single MySQL database
- Simple SELECT and UPDATE queries
- No locking mechanism
- Frequent double bookings
The breaking point: IPL cricket final, 2015. 500,000 people tried to book tickets simultaneously. The system collapsed. Double bookings everywhere. Customer support was flooded.
What they did:
-
Sharded the database by city: Mumbai shows in one shard, Delhi in another. No cross-shard queries needed.
-
Implemented Redis-based distributed locking: Sub-millisecond lock acquisition, automatic expiry.
-
Added virtual queue for high-demand shows: Fair access, controlled load.
-
Separated read and write databases: Search queries hit replicas, bookings hit primary.
-
Pre-scaled for known events: IPL matches, blockbuster releases - they doubled capacity 2 hours before.
Results:
- 99.9% reduction in double bookings
- Handled 50M+ monthly users
- Survived every major event since
Ticketmaster: The Taylor Swift Wake-Up Call
November 15, 2022. The day Ticketmaster’s CEO had to testify before Congress.
What went wrong:
- Underestimated demand: Expected 1.5 million concurrent users, got 14 million
- No effective queue: Everyone hit the system simultaneously
- Database overload: Connection pools exhausted in minutes
- Cascading failures: Frontend crashed, then backend, then database
- Bot detection failure: Bots consumed 40% of capacity
The congressional testimony revealed:
- 3.5 billion system requests in just a few hours
- Only 2 million tickets available
- System designed for 1/10th of the actual load
- No circuit breakers or graceful degradation
What they’ve implemented since:
- Mandatory virtual queue: No one bypasses it, not even verified fans
- Dynamic rate limiting: Aggressive bot detection and blocking
- Capacity-based ticket release: Release tickets in waves based on system capacity
- Better scaling: Auto-scaling groups that spin up 10x capacity in minutes
- Chaos engineering: Regular stress tests with 10M simulated users
The lesson: Hope is not a strategy. Test at 10x your expected load.
The Caching Strategy: Speed Without Chaos
Caching is critical, but cache the wrong thing and you’re selling the same seat twice.
What to cache (and for how long):
flowchart LR
Movies[Movie Details] --> Static[Cache: 24 hours]
Venues[Theater Info] --> Static
Shows[Show Timings] --> SemiStatic[Cache: 1 hour]
Pricing[Dynamic Pricing] --> Dynamic[Cache: 10 seconds]
Locks[Seat Locks] --> NeverCache[Never Cache]
style Static fill:#90EE90,stroke:#333,stroke-width:2px
style SemiStatic fill:#fff3cd,stroke:#333,stroke-width:2px
style Dynamic fill:#FFE4B5,stroke:#333,stroke-width:2px
style NeverCache fill:#FFB6B6,stroke:#333,stroke-width:2px
The golden rules:
- Cache aggressively: Movie posters, theater layouts, seat maps
- Cache counts, not specifics: “50 seats available” (cached), not “A7 is available” (never cache)
- Cache search results: “Action movies in Mumbai” - cache for 5 minutes
- Never cache locks: Always hit Redis/DB for lock status
- Invalidate proactively: When show is added/canceled, invalidate immediately
Cache warming for hot events:
%%{init: {'theme':'base', 'themeVariables': { 'actorBkg':'#E8F5E9','actorBorder':'#333','actorTextColor':'#333','fontSize':'16px'}}}%%
sequenceDiagram
participant A as Admin
participant S as System
participant C as Cache
participant D as DB
A->>S: Mark high demand
S->>D: Load show data
S->>D: Load seat layout
S->>D: Load pricing
S->>C: Pre-warm cache
Note over C: Data ready
S->>S: Scale up 5x
Note over A,C: 2 hrs before
Note over S: Go live
Note over C: Cache hit
Note over S: Fast
Key Takeaways
1. Prevent race conditions - Use distributed locks (Redis) for seat reservations. Test with concurrent requests.
2. Always use timeouts - Seat locks (5-10 min), payments (30 sec), queries (5 sec). No exceptions.
3. Control traffic - Virtual queues prevent stampedes. Rate limiting protects your system during flash sales.
4. Handle payment failures - Use idempotency keys to prevent double charges. Run reconciliation jobs for edge cases.
5. Cache wisely - Cache movie listings and theater layouts. Never cache seat availability or locks.
6. Test at scale - If you expect 100K users, test with 1M. Chaos engineering catches what you miss.
The Bottom Line
Ticket booking systems are hard because they demand correctness (never double-book), speed (sub-3-second response), and massive scale (10x expected load).
The difference between BookMyShow’s success and Ticketmaster’s Taylor Swift disaster? Distributed locks, virtual queues, idempotency keys, and relentless testing.
You now know what they learned the hard way. Build carefully, test ruthlessly.
Want to dive deeper? Check out the BookMyShow engineering blog for war stories from the trenches, and explore distributed systems patterns to level up your architecture game.