Database sharding is the practice of horizontally partitioning a database across multiple servers. Each shard holds a subset of the data, and together they form the complete dataset. Sharding is essential for applications that outgrow a single database instance. This article covers sharding strategies, rebalancing challenges, and available tools.
Why Shard?
A single database instance has limits. CPU, memory, disk I/O, and network bandwidth all cap out eventually. Before resorting to sharding, consider these alternatives:
When these options are exhausted, sharding becomes necessary. Sharding distributes both read and write load across multiple nodes.
Sharding Strategies
Key-Based Sharding (Hash Sharding)
Key-based sharding uses a hash function on the shard key to determine which shard stores a record.
shard_id = hash(shard_key) % number_of_shards
def get_shard(user_id, num_shards=8):
return hash(user_id) % num_shards
**Advantages**:
**Disadvantages**:
Range-Based Sharding
Range-based sharding divides data by ranges of the shard key.
Shard 1: user_id 0001-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000
**Advantages**:
**Disadvantages**:
Directory-Based Sharding
Directory-based sharding uses a lookup table that maps keys to shards.
Lookup table:
user_1 -> shard_3
user_2 -> shard_1
user_3 -> shard_3
# Directory-based routing
class ShardDirectory:
def __init__(self):
self.mapping = {} # key -> shard_id
def get_shard(self, key):
return self.mapping.get(key)
def assign_shard(self, key, shard_id):
self.mapping[key] = shard_id
**Advantages**:
**Disadvantages**:
Choosing a Shard Key
The shard key is the most important decision in sharding. A good shard key:
-- Bad shard key example: monotonically increasing ID
CREATE TABLE orders (
order_id BIGSERIAL PRIMARY KEY, -- monotonically increasing
user_id INT,
created_at TIMESTAMP,
-- ...
);
-- Good shard key for order data
-- Use a hash of user_id or customer_id so orders stay with the customer
Resharding Challenges
Adding or removing shards is one of the hardest operations in database administration.
The Resharding Problem
With hash-based sharding using `hash(key) % N`, changing N from 8 to 9 means almost every key maps to a different shard. Moving terabytes of live data without downtime is complex.
Resharding Approaches
2. **Virtual shards (consistent hashing)**: Use many virtual shards (e.g., 4096) mapped to fewer physical shards. Adding a physical shard involves reassigning some virtual shards, not all data.
3. **Pre-splitting**: Start with more shards than you need. As data grows, redistribute shards. This delays resharding but does not eliminate it.
4. **Read repair / write behind**: During resharding, route reads to the old shard and writes to both. A background process moves data and reconciles differences.
Sharding Tools and Platforms
MySQL Fabric
MySQL Fabric provides sharding management for MySQL. It handles shard creation, data distribution, and routing. It is an Oracle-supported solution but has limited adoption compared to alternatives.
Vitess
Vitess is a database clustering system for horizontal scaling of MySQL. It powers YouTube's MySQL deployment and is now a CNCF project.
Key features:
-- Vitess schema with sharding key annotation
CREATE TABLE users (
user_id BIGINT,
name VARCHAR(255),
email VARCHAR(255),
PRIMARY KEY (user_id)
) VITESS_SHARD_BY HASH(user_id);
Citus
Citus is a PostgreSQL extension that transforms PostgreSQL into a distributed database. It distributes tables across a cluster while maintaining PostgreSQL compatibility.
-- Create a distributed table in Citus
SELECT create_distributed_table('orders', 'customer_id');
-- Citus automatically routes queries to the correct shard
SELECT COUNT(*) FROM orders WHERE customer_id = 42;
-- Runs on one shard (fast)
SELECT COUNT(*) FROM orders WHERE created_at > now() - interval '1 day';
-- Runs on all shards (coordinator aggregates)
MongoDB Sharding
MongoDB has built-in sharding with a config server replica set and mongos routers.
// Enable sharding on a database
sh.enableSharding("my_database")
// Shard a collection with range-based hashed key
sh.shardCollection("my_database.orders", { "order_id": "hashed" })
Query Patterns with Sharding
Sharding works well when queries target a single shard. Cross-shard queries are expensive.
When Not to Shard
Sharding adds complexity. Do not shard unless you need to:
For many applications, a well-tuned single database with replicas and caching is sufficient.
Conclusion
Sharding is a powerful but complex scaling technique. Choose hash-based sharding for even distribution, range-based for ordered data with predictable access patterns, or directory-based for maximum flexibility. Design your shard key carefully, plan for resharding from the start, and choose tools like Vitess or Citus that handle the complexity. Most importantly, exhaust simpler scaling options before adopting sharding.