Distributed databases store data across multiple nodes while presenting a unified interface to applications. This article covers the fundamental concepts: consensus algorithms (Raft and Paxos), gossip protocols, conflict-free replicated data types (CRDTs), and the architectures behind two landmark distributed databases: Amazon Dynamo and Google Spanner.
Consensus Algorithms: Paxos and Raft
Consensus algorithms enable a group of nodes to agree on a value despite failures. They are the foundation of replicated state machines in distributed databases.
Paxos
Paxos was the first practical consensus algorithm, published by Leslie Lamport in 1998. It defines three roles:
Paxos operates in two phases:
2. **Accept phase**: If the proposer receives responses from a majority, it sends an accept request with value V (the value from the highest-numbered previous proposal, or its own if none) to the acceptors. Acceptors accept the proposal unless they have promised to a higher number.
# Simplified Paxos implementation
class PaxosNode:
def __init__(self, node_id):
self.node_id = node_id
self.promised_n = 0
self.accepted_n = 0
self.accepted_value = None
def handle_prepare(self, proposal_n, sender_id):
if proposal_n > self.promised_n:
self.promised_n = proposal_n
return ("PROMISE", self.accepted_n, self.accepted_value)
return ("REJECT",)
def handle_accept(self, proposal_n, value):
if proposal_n >= self.promised_n:
self.promised_n = proposal_n
self.accepted_n = proposal_n
self.accepted_value = value
return ("ACCEPTED",)
return ("REJECT",)
Paxos is correct but notoriously difficult to understand and implement correctly. Most practical systems use variants or alternatives.
Raft
Raft is a consensus algorithm designed for understandability. It was published by Diego Ongaro in 2014. Raft divides consensus into three subproblems:
2. **Log replication**: The leader replicates log entries to followers.
3. **Safety**: If a leader has applied a log entry at a given index, no other leader can apply a different entry at that index.
Raft uses random timeouts to elect leaders. Each node has a timer. When the timer expires on a follower, it becomes a candidate and starts an election.
# Simplified Raft leader election
class RaftNode:
def __init__(self, node_id):
self.node_id = node_id
self.state = "FOLLOWER"
self.current_term = 0
self.voted_for = None
self.election_timeout = random.uniform(150, 300) # milliseconds
def start_election(self):
self.state = "CANDIDATE"
self.current_term += 1
self.voted_for = self.node_id
votes = 1 # Vote for self
for peer in self.peers:
response = peer.request_vote(self.current_term, self.node_id)
if response["vote_granted"]:
votes += 1
if votes > len(self.peers) // 2:
self.state = "LEADER"
self.start_heartbeat()
Raft implementations power etcd, Consul, TiKV, and MongoDB's replication.
Paxos vs Raft
| Aspect | Paxos | Raft |
|--------|-------|------|
| Understandability | Hard | Accessible |
| Leader-based | Implicit leader | Explicit leader election |
| Performance | Similar | Similar |
| Practical implementations | Google Chubby, Spanner | etcd, Consul, TiKV |
Gossip Protocol
Gossip (epidemic) protocols spread information through a cluster by having each node periodically exchange state with a small set of random peers. Information spreads exponentially.
How Gossip Works
Round 1: Node A infects nodes B and C
Round 2: B infects D and E, C infects F and G
Round 3: All 7 nodes have the information
# Gossip protocol implementation
import random
import time
class GossipNode:
def __init__(self, node_id, peers):
self.node_id = node_id
self.peers = peers
self.state = {}
self.version = {}
def gossip_round(self):
# Select N random peers to gossip with
targets = random.sample(
self.peers,
min(3, len(self.peers))
)
for target in targets:
self.exchange_state(target)
def exchange_state(self, target):
# Send our state, receive theirs
my_state = self.get_state_to_share()
their_state = target.receive_state(my_state)
# Merge: keep the higher version for each key
for key, (value, version) in their_state.items():
if key not in self.version or version > self.version[key]:
self.state[key] = value
self.version[key] = version
Properties of Gossip
Uses in Distributed Databases
Conflict-Free Replicated Data Types (CRDTs)
CRDTs are data structures that can be updated concurrently on different replicas and automatically merge without conflicts. They provide strong eventual consistency.
Types of CRDTs
**G-Counter (Grow-only Counter)**: A counter that only increases. Each replica owns one element in a vector. Increments only affect the replica's own element. Merge takes the element-wise maximum.
class GCounter:
def __init__(self, node_id, num_nodes):
self.node_id = node_id
self.counts = [0] * num_nodes
def increment(self):
self.counts[self.node_id] += 1
def value(self):
return sum(self.counts)
def merge(self, other):
for i in range(len(self.counts)):
self.counts[i] = max(self.counts[i], other.counts[i])
**PN-Counter (Positive-Negative Counter)**: Supports both increments and decrements using two G-Counters.
**G-Set (Grow-only Set)**: Elements can only be added, never removed. Merge is set union.
**LWW-Register (Last-Writer-Wins Register)**: Each write is tagged with a timestamp. On conflict, the write with the highest timestamp wins.
class LWWRegister:
def __init__(self):
self.value = None
self.timestamp = 0
def set(self, value, timestamp):
if timestamp > self.timestamp:
self.value = value
self.timestamp = timestamp
def merge(self, other):
if other.timestamp > self.timestamp:
self.value = other.value
self.timestamp = other.timestamp
CRDTs in Production
Amazon Dynamo-Style Architectures
Dynamo-style databases (DynamoDB, Cassandra, Riak) prioritize availability and partition tolerance over strong consistency.
Key Design Decisions
2. **N = 3 replication**: Each data item is replicated to N nodes. The coordinator forwards writes to all replicas.
3. **Tunable consistency**: Each read or write operation specifies its consistency level.
# Dynamo-style write with tunable consistency
def dynamo_write(key, value, consistency_level="QUORUM"):
# 1. Determine coordinator using consistent hashing
coordinator = consistent_hash_ring.get_node(key)
# 2. Forward write to all N replicas
replicas = coordinator.get_preference_list(key, N=3)
responses = []
for replica in replicas:
response = replica.write(key, value, version)
responses.append(response)
# 3. Wait for consistency level
write_consistency = {
"ONE": 1,
"QUORUM": len(replicas) // 2 + 1,
"ALL": len(replicas)
}
required = write_consistency[consistency_level]
if sum(1 for r in responses if r["success"]) >= required:
return SUCCESS
return ERROR
Cassandra Example
-- Cassandra: Tunable consistency on queries
-- Write with strong consistency
INSERT INTO users (user_id, name, email)
VALUES ('u1', 'Alice', 'alice@example.com')
USING CONSISTENCY QUORUM;
-- Read with eventual consistency (fastest)
SELECT * FROM users WHERE user_id = 'u1'
USING CONSISTENCY ONE;
-- Read with strong consistency
SELECT * FROM users WHERE user_id = 'u1'
USING CONSISTENCY ALL;
Google Spanner-Style Architectures
Spanner is Google's globally distributed database that provides external consistency (linearizability) across data centers.
Key Innovations
2. **2PC + Paxos**: Spanner uses two-phase commit for transactions spanning multiple Paxos groups, with TrueTime timestamps for external consistency.
3. **Commit wait**: Spanner waits until the TrueTime uncertainty interval has passed before acknowledging a write. This ensures linearizability without locks.
-- Spanner: Globally consistent reads at a timestamp
SELECT * FROM orders
WHERE customer_id = 42
AND order_date >= '2026-01-01';
-- Automatically reads at a globally consistent timestamp
Spanner-based solutions include Cloud Spanner (Google Cloud) and CockroachDB (open-source, Spanner-inspired).
Conclusion
Distributed databases rely on consensus algorithms (Raft/Paxos) for consistent replication, gossip protocols for decentralized communication, CRDTs for conflict-free merging, and careful architecture design that balances consistency, availability, and latency. Dynamo-style systems optimize for availability with tunable consistency. Spanner-style systems optimize for strong consistency with global distribution. Choosing the right approach depends on your workload's consistency requirements and latency tolerance.