You send a message in Slack, and within milliseconds, it appears on your colleague’s screen across the globe. Behind this simple action lies one of the most sophisticated real-time messaging systems ever built.
Slack processes over 10 billion messages annually, serves 20+ million daily active users, and maintains 99.99% uptime. Originally a failed gaming company’s internal tool, Slack accidentally solved the massive challenge of real-time communication at scale.
The core challenges they solved:
- Instant delivery to the right people
- Perfect ordering even when networks fail
- Seamless experience across all devices
- Search through years of history
- Zero downtime for millions of users
Let’s see how they built a system that never sleeps.
Slack’s Architecture: The 30,000-Foot View
Before diving into the nitty-gritty, here’s how Slack’s system looks from above:
graph TB
subgraph "Client Layer"
Mobile[📱 Mobile Apps]
Desktop[💻 Desktop Apps]
Web[🌐 Web Apps]
end
subgraph "Load Balancing"
LB[⚖️ Load Balancer]
end
subgraph "Application Layer"
WebApp[🖥️ WebApp Servers]
RTM[⚡ Real-time Servers]
end
subgraph "Data Layer"
MySQL[(🗃️ MySQL Shards)]
Redis[(💾 Redis Cache)]
Search[(🔍 Search Service)]
FileStore[(📁 File Storage)]
end
subgraph "External"
Queue[📥 Job Queues]
CDN[☁️ CDN]
end
Mobile --> LB
Desktop --> LB
Web --> LB
LB --> WebApp
LB --> RTM
WebApp --> MySQL
WebApp --> Redis
WebApp --> Queue
RTM --> MySQL
RTM --> Redis
Search --> MySQL
CDN --> FileStore
The Two-Brain Architecture
Slack’s key insight: separate the brain that thinks from the brain that talks.
WebApp Servers (The Thinkers)
- User authentication and authorization
- Message validation and processing
- Database writes and complex queries
- File uploads and integrations
Real-Time Messaging Servers (The Talkers)
- WebSocket connection management
- Message broadcasting
- Presence updates and typing indicators
- Real-time notifications
Why split them? The WebApp servers focus on being thorough. RTM servers focus on speed. Each scales independently.
Here’s how a message flows through both systems:
sequenceDiagram
participant Client1 as 👤 User A
participant WebApp as 🖥️ WebApp Server
participant DB as 🗃️ Database
participant RTM as ⚡ RTM Server
participant Client2 as 👤 User B
Client1->>WebApp: Send message
WebApp->>WebApp: Validate & process
WebApp->>DB: Store message
WebApp->>RTM: Notify: new message
RTM->>Client2: Broadcast message
WebApp->>Client1: Confirm delivery
The Workspace Sharding Strategy
Slack’s key insight: workspaces are natural boundaries. Users in different companies rarely interact, and each workspace has its own isolated data.
So Slack shards everything by workspace:
graph LR
subgraph "Shard 3"
WS3[Workspace C]
DB3[(Database 3)]
RTM3[RTM Server 3]
Search3[Search Index 3]
end
subgraph "Shard 2"
WS2[Workspace B]
DB2[(Database 2)]
RTM2[RTM Server 2]
Search2[Search Index 2]
end
subgraph "Shard 1"
WS1[Workspace A]
DB1[(Database 1)]
RTM1[RTM Server 1]
Search1[Search Index 1]
end
WS1 --> DB1
WS1 --> RTM1
WS1 --> Search1
WS2 --> DB2
WS2 --> RTM2
WS2 --> Search2
WS3 --> DB3
WS3 --> RTM3
WS3 --> Search3
Benefits: Linear scaling, fault isolation, data locality, and simpler queries.
The Thundering Herd Problem
Imagine 50,000 employees losing internet, then reconnecting simultaneously. Each client needs authentication, workspace snapshots, WebSocket connections, and message sync.
This Thundering Herd Problem nearly broke Slack’s early architecture.
Here’s how they solved it:
1. Exponential Backoff with Jitter
Instead of everyone reconnecting immediately:
1
reconnect_delay = base_delay * (2 ^ attempt_number) + random_jitter
This spreads reconnections over time instead of clustering them.
2. Snapshot Optimization
They drastically reduced what gets sent during reconnection:
- Only essential channel metadata (not full history)
- Lazy-load message history as users scroll
- Compress large user lists
- Send presence updates in batches
3. Connection Pooling
RTM servers pool WebSocket connections for efficient reuse.
Database Design: MySQL at Scale
Slack chose MySQL over NoSQL and made it work through smart sharding.
Storage challenge: Recent messages need instant access, old messages need searchability, and everything must handle edits/deletes.
Solution: Tiered storage by age.
graph TD
subgraph "Hot Storage (Last 30 days)"
MySQL1[(MySQL Primary)]
Redis1[(Redis Cache)]
end
subgraph "Warm Storage (30-365 days)"
MySQL2[(MySQL Secondary)]
Redis2[(Limited Cache)]
end
subgraph "Cold Storage (>1 year)"
S3[(Amazon S3)]
Archive[(Compressed Archives)]
end
subgraph "Search Layer"
ES[(Elasticsearch)]
end
App[Application] --> Redis1
Redis1 --> MySQL1
MySQL1 --> MySQL2
MySQL2 --> S3
MySQL1 --> ES
MySQL2 --> ES
S3 --> ES
App --> ES
Key insight: No foreign keys between shards keeps each workspace completely independent.
Real-Time Magic: WebSockets Done Right
The real-time experience is where Slack shines. Here’s how they make it feel instant:
Connection State Management
Problem: WebSocket connections aren’t stateless. Each server needs to know your channels, presence, and read state.
Solution: Store connection state in Redis for seamless failover.
graph TB
subgraph "RTM Cluster"
RTM1[RTM Server 1]
RTM2[RTM Server 2]
RTM3[RTM Server 3]
end
subgraph "State Storage"
Redis[(Redis Cluster)]
end
subgraph "Clients"
User1[👤 User A]
User2[👤 User B]
User3[👤 User C]
end
User1 --> RTM1
User2 --> RTM2
User3 --> RTM3
RTM1 --> Redis
RTM2 --> Redis
RTM3 --> Redis
Redis --> RTM1
Redis --> RTM2
Redis --> RTM3
Message Delivery Guarantees
Problem: Networks are unreliable. Messages can get lost, duplicated, or arrive out of order.
Solution: At-least-once delivery with client-side deduplication:
- Server assigns unique ID to each message
- Client acknowledges receipt
- Server retries if no acknowledgment
- Client deduplicates using message ID
Search: Essential for Productivity
With billions of messages, great search transforms user productivity.
graph TB
subgraph "Search Pipeline"
Messages[(Messages DB)]
Indexer[Message Indexer]
ES[(Elasticsearch)]
SearchAPI[Search API]
end
subgraph "User Experience"
User[👤 User]
SearchBox[🔍 Search Interface]
Results[📋 Search Results]
end
Messages --> Indexer
Indexer --> ES
User --> SearchBox
SearchBox --> SearchAPI
SearchAPI --> ES
ES --> Results
Results --> User
Search optimizations:
- Real-time indexing (seconds)
- Smart autocomplete
- Contextual ranking
- Aggressive caching
Query examples: from:@john after:2023-01-01
, in:#general has:link
File Sharing at Scale
People share everything from tiny text files to 1GB videos.
graph TB
subgraph "Upload Flow"
Client[📱 Client]
Upload[⬆️ Upload Service]
Virus[🛡️ Virus Scanner]
Process[⚙️ File Processor]
CDN[☁️ CDN]
end
subgraph "Storage"
S3[(🗃️ S3 Storage)]
Thumbs[(🖼️ Thumbnails)]
Preview[(👁️ Preview Cache)]
end
Client --> Upload
Upload --> Virus
Virus --> Process
Process --> S3
Process --> Thumbs
Process --> Preview
S3 --> CDN
Thumbs --> CDN
Preview --> CDN
Key challenges: Security scanning, performance optimization, compliance, and cost management.
Monitoring Philosophy
Key insight: Alert on user impact, not system metrics.
Instead of CPU alerts, Slack monitors:
- Message delivery time > 2 seconds
- Search latency > 500ms
- Connection failure rate > 1%
This keeps focus on actual user experience.
Key Lessons Learned
1. Workspace Migration: Large workspaces need dedicated shards. Size distribution is extremely uneven.
2. Search Optimization: Tiered indexing - recent messages get full indexing, older messages get compressed indexing.
3. Mobile Battery: Intelligent connection management - close connections when backgrounded.
Technology Stack
Core: PHP (WebApp), Java (RTM), MySQL (sharded), Redis (caching), Elasticsearch (search) Infrastructure: AWS, HAProxy, CloudFront CDN Mobile: Swift (iOS), Kotlin (Android)
Key Takeaways
1. Sharding Strategy > Technology Choice: Workspace-based sharding defined Slack’s scalability ceiling, not MySQL vs MongoDB.
2. Separate Read/Write Concerns: The WebApp/RTM split is brilliant - specialize your services.
3. Plan for Thundering Herds: Build exponential backoff into reconnection logic from day one.
4. Monitor User Experience: Instrument what users feel, not just CPU metrics.
5. Real-time is Expected: Users expect real-time everything in 2024.
6. Search is a Differentiator: Great search transforms productivity.
The Bottom Line
Slack’s success comes from smart trade-offs, not cool technology:
- Simplicity over complexity (MySQL over NoSQL)
- Specialization over generalization (separate WebApp/RTM)
- User experience over system metrics (alert on latency, not CPU)
The result? A system handling billions of messages that feels instant.
Want to dive deeper into system design? Check out our other architecture deep-dives and learn how the world’s biggest platforms solve impossible problems.