Software Engineering Glossary

Sharding

Also known as: Horizontal Partitioning Data Partitioning

Sharding is splitting one logical database across many machines by row, so each shard holds a subset of the data. A shard key decides which shard a row lives on. It lets a dataset and its write load grow past what a single server can hold, but it gives up easy cross-shard joins and single-node transactions, and a bad shard key creates hot spots that defeat the whole point.

Key Takeaways

  • Sharding is horizontal scaling for data: split rows across nodes instead of buying a bigger server.
  • The shard key is the most important design choice. A skewed key creates hot shards; a good key spreads load evenly.
  • Range sharding keeps neighbours together but risks hot ranges; hash sharding spreads load but loses range scans. Consistent hashing limits reshuffling when nodes change.
  • Cross-shard joins and transactions are hard, which is why a clean shard key that keeps related data together matters so much.

How It Works

  1. Pick a shard key and a strategy: by range, by hash of the key, or by an explicit directory.
  2. A router maps each row’s shard key to the shard that owns it.
  3. Reads and writes for a key go straight to its shard; queries that span shards must scatter and gather.
  4. Rebalancing moves data between shards as nodes are added or removed, ideally moving as little as possible.

Where It Is Used

  • MongoDB shards collections by a shard key; DynamoDB partitions by the partition key under the hood.
  • Vitess shards MySQL behind a routing layer, the approach YouTube and many others use.
  • Google Spanner splits data into splits across servers for global, horizontally scaled SQL.

Related glossary terms

Advertisement