Database Sharding: Strategies and Trade-offs


What is Sharding?

Sharding splits a database across multiple servers horizontally. Each shard holds a subset of data, allowing linear scalability.

Key-Based Sharding

Hash the shard key to determine the target shard:




class KeyBasedShardManager:


def __init__(self, num_shards=4):


self.num_shards = num_shards


self.shards = [Shard(i) for i in range(num_shards)]




def get_shard(self, shard_key):


hash_val = int(hashlib.sha256(str(shard_key).encode()).hexdigest(), 16)


shard_id = hash_val % self.num_shards


return self.shards[shard_id]





Range-Based Sharding

Partition by value ranges:




CREATE TABLE orders (


id BIGSERIAL, order_date DATE, total DECIMAL(10,2),


PRIMARY KEY (id, order_date)


) PARTITION BY RANGE (order_date);




CREATE TABLE orders_2026_01 PARTITION OF orders


FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');





Directory-Based Sharding

Use a lookup table for shard mapping:




class DirectoryShardManager:


def __init__(self):


self.directory = {}




def map_key_to_shard(self, shard_key, shard_id):


self.directory[shard_key] = shard_id




def get_shard(self, shard_key):


return self.directory.get(shard_key)





Rebalancing

When adding or removing shards, data must be redistributed. Use consistent hashing to minimize data movement. Tools like Vitess and Citus automate this process.

Conclusion

Choose key-based sharding for even distribution, range-based for time-series data, and directory-based for maximum flexibility. Design shard keys carefully for even distribution. Plan for rebalancing from the start. Avoid cross-shard queries where possible.