Distributed databases store data across multiple nodes while presenting a unified interface to applications. This article covers the fundamental concepts: consensus algorithms (Raft and Paxos), gossip protocols, conflict-free replicated data types (CRDTs), and the architectures behind two landmark distributed databases: Amazon Dynamo and Google Spanner.


Consensus Algorithms: Paxos and Raft


Consensus algorithms enable a group of nodes to agree on a value despite failures. They are the foundation of replicated state machines in distributed databases.


Paxos


Paxos was the first practical consensus algorithm, published by Leslie Lamport in 1998. It defines three roles:


  • **Proposer**: Proposes a value.
  • **Acceptor**: Votes on proposals.
  • **Learner**: Learns the agreed value.

  • Paxos operates in two phases:


  • **Prepare phase**: The proposer sends a prepare request with a proposal number N to a majority of acceptors. Acceptors promise not to accept proposals numbered less than N and respond with the highest-numbered proposal they have accepted.

  • 2. **Accept phase**: If the proposer receives responses from a majority, it sends an accept request with value V (the value from the highest-numbered previous proposal, or its own if none) to the acceptors. Acceptors accept the proposal unless they have promised to a higher number.


    
    # Simplified Paxos implementation
    
    class PaxosNode:
    
        def __init__(self, node_id):
    
            self.node_id = node_id
    
            self.promised_n = 0
    
            self.accepted_n = 0
    
            self.accepted_value = None
    
        
    
        def handle_prepare(self, proposal_n, sender_id):
    
            if proposal_n > self.promised_n:
    
                self.promised_n = proposal_n
    
                return ("PROMISE", self.accepted_n, self.accepted_value)
    
            return ("REJECT",)
    
        
    
        def handle_accept(self, proposal_n, value):
    
            if proposal_n >= self.promised_n:
    
                self.promised_n = proposal_n
    
                self.accepted_n = proposal_n
    
                self.accepted_value = value
    
                return ("ACCEPTED",)
    
            return ("REJECT",)
    
    

    Paxos is correct but notoriously difficult to understand and implement correctly. Most practical systems use variants or alternatives.


    Raft


    Raft is a consensus algorithm designed for understandability. It was published by Diego Ongaro in 2014. Raft divides consensus into three subproblems:


  • **Leader election**: One node is elected leader. The leader handles all client requests.
  • 2. **Log replication**: The leader replicates log entries to followers.

    3. **Safety**: If a leader has applied a log entry at a given index, no other leader can apply a different entry at that index.


    Raft uses random timeouts to elect leaders. Each node has a timer. When the timer expires on a follower, it becomes a candidate and starts an election.


    
    # Simplified Raft leader election
    
    class RaftNode:
    
        def __init__(self, node_id):
    
            self.node_id = node_id
    
            self.state = "FOLLOWER"
    
            self.current_term = 0
    
            self.voted_for = None
    
            self.election_timeout = random.uniform(150, 300)  # milliseconds
    
        
    
        def start_election(self):
    
            self.state = "CANDIDATE"
    
            self.current_term += 1
    
            self.voted_for = self.node_id
    
            votes = 1  # Vote for self
    
            for peer in self.peers:
    
                response = peer.request_vote(self.current_term, self.node_id)
    
                if response["vote_granted"]:
    
                    votes += 1
    
            if votes > len(self.peers) // 2:
    
                self.state = "LEADER"
    
                self.start_heartbeat()
    
    

    Raft implementations power etcd, Consul, TiKV, and MongoDB's replication.


    Paxos vs Raft


    | Aspect | Paxos | Raft |

    |--------|-------|------|

    | Understandability | Hard | Accessible |

    | Leader-based | Implicit leader | Explicit leader election |

    | Performance | Similar | Similar |

    | Practical implementations | Google Chubby, Spanner | etcd, Consul, TiKV |


    Gossip Protocol


    Gossip (epidemic) protocols spread information through a cluster by having each node periodically exchange state with a small set of random peers. Information spreads exponentially.


    How Gossip Works


    
    Round 1: Node A infects nodes B and C
    
    Round 2: B infects D and E, C infects F and G
    
    Round 3: All 7 nodes have the information
    
    

    
    # Gossip protocol implementation
    
    import random
    
    import time
    
    
    
    class GossipNode:
    
        def __init__(self, node_id, peers):
    
            self.node_id = node_id
    
            self.peers = peers
    
            self.state = {}
    
            self.version = {}
    
        
    
        def gossip_round(self):
    
            # Select N random peers to gossip with
    
            targets = random.sample(
    
                self.peers, 
    
                min(3, len(self.peers))
    
            )
    
            for target in targets:
    
                self.exchange_state(target)
    
        
    
        def exchange_state(self, target):
    
            # Send our state, receive theirs
    
            my_state = self.get_state_to_share()
    
            their_state = target.receive_state(my_state)
    
            # Merge: keep the higher version for each key
    
            for key, (value, version) in their_state.items():
    
                if key not in self.version or version > self.version[key]:
    
                    self.state[key] = value
    
                    self.version[key] = version
    
    

    Properties of Gossip


  • **Scalability**: Information reaches all N nodes in O(log N) rounds.
  • **Fault tolerance**: Nodes can join, leave, or fail without disrupting propagation.
  • **Decentralization**: No central coordinator needed.
  • **Eventual consistency**: All nodes eventually converge on the same state.

  • Uses in Distributed Databases


  • **Membership detection**: Cassandra and Consul use gossip to track which nodes are alive.
  • **Metadata dissemination**: Distribute schema changes and configuration updates.
  • **Failure detection**: Nodes that miss gossip rounds are suspected as failed.

  • Conflict-Free Replicated Data Types (CRDTs)


    CRDTs are data structures that can be updated concurrently on different replicas and automatically merge without conflicts. They provide strong eventual consistency.


    Types of CRDTs


    **G-Counter (Grow-only Counter)**: A counter that only increases. Each replica owns one element in a vector. Increments only affect the replica's own element. Merge takes the element-wise maximum.


    
    class GCounter:
    
        def __init__(self, node_id, num_nodes):
    
            self.node_id = node_id
    
            self.counts = [0] * num_nodes
    
        
    
        def increment(self):
    
            self.counts[self.node_id] += 1
    
        
    
        def value(self):
    
            return sum(self.counts)
    
        
    
        def merge(self, other):
    
            for i in range(len(self.counts)):
    
                self.counts[i] = max(self.counts[i], other.counts[i])
    
    

    **PN-Counter (Positive-Negative Counter)**: Supports both increments and decrements using two G-Counters.


    **G-Set (Grow-only Set)**: Elements can only be added, never removed. Merge is set union.


    **LWW-Register (Last-Writer-Wins Register)**: Each write is tagged with a timestamp. On conflict, the write with the highest timestamp wins.


    
    class LWWRegister:
    
        def __init__(self):
    
            self.value = None
    
            self.timestamp = 0
    
        
    
        def set(self, value, timestamp):
    
            if timestamp > self.timestamp:
    
                self.value = value
    
                self.timestamp = timestamp
    
        
    
        def merge(self, other):
    
            if other.timestamp > self.timestamp:
    
                self.value = other.value
    
                self.timestamp = other.timestamp
    
    

    CRDTs in Production


  • **Redis CRDTs**: Redis Enterprise uses CRDTs for active-active geo-distribution.
  • **Riak**: Distributed database built on CRDTs for conflict-free replication.
  • **Automerge**: JavaScript CRDT library for collaborative applications.

  • Amazon Dynamo-Style Architectures


    Dynamo-style databases (DynamoDB, Cassandra, Riak) prioritize availability and partition tolerance over strong consistency.


    Key Design Decisions


  • **Consistent hashing**: Data is distributed across nodes using consistent hashing, which minimizes movement when nodes join or leave.
  • 2. **N = 3 replication**: Each data item is replicated to N nodes. The coordinator forwards writes to all replicas.

    3. **Tunable consistency**: Each read or write operation specifies its consistency level.


    
    # Dynamo-style write with tunable consistency
    
    def dynamo_write(key, value, consistency_level="QUORUM"):
    
        # 1. Determine coordinator using consistent hashing
    
        coordinator = consistent_hash_ring.get_node(key)
    
        
    
        # 2. Forward write to all N replicas
    
        replicas = coordinator.get_preference_list(key, N=3)
    
        
    
        responses = []
    
        for replica in replicas:
    
            response = replica.write(key, value, version)
    
            responses.append(response)
    
        
    
        # 3. Wait for consistency level
    
        write_consistency = {
    
            "ONE": 1,
    
            "QUORUM": len(replicas) // 2 + 1,
    
            "ALL": len(replicas)
    
        }
    
        
    
        required = write_consistency[consistency_level]
    
        if sum(1 for r in responses if r["success"]) >= required:
    
            return SUCCESS
    
        return ERROR
    
    

    Cassandra Example


    
    -- Cassandra: Tunable consistency on queries
    
    -- Write with strong consistency
    
    INSERT INTO users (user_id, name, email)
    
    VALUES ('u1', 'Alice', 'alice@example.com')
    
    USING CONSISTENCY QUORUM;
    
    
    
    -- Read with eventual consistency (fastest)
    
    SELECT * FROM users WHERE user_id = 'u1'
    
    USING CONSISTENCY ONE;
    
    
    
    -- Read with strong consistency
    
    SELECT * FROM users WHERE user_id = 'u1'
    
    USING CONSISTENCY ALL;
    
    

    Google Spanner-Style Architectures


    Spanner is Google's globally distributed database that provides external consistency (linearizability) across data centers.


    Key Innovations


  • **TrueTime**: GPS and atomic clocks synchronize time across data centers with bounded clock uncertainty. TrueTime exposes a time interval `[earliest, latest]` for the current time.

  • 2. **2PC + Paxos**: Spanner uses two-phase commit for transactions spanning multiple Paxos groups, with TrueTime timestamps for external consistency.


    3. **Commit wait**: Spanner waits until the TrueTime uncertainty interval has passed before acknowledging a write. This ensures linearizability without locks.


    
    -- Spanner: Globally consistent reads at a timestamp
    
    SELECT * FROM orders
    
    WHERE customer_id = 42
    
    AND order_date >= '2026-01-01';
    
    -- Automatically reads at a globally consistent timestamp
    
    

    Spanner-based solutions include Cloud Spanner (Google Cloud) and CockroachDB (open-source, Spanner-inspired).


    Conclusion


    Distributed databases rely on consensus algorithms (Raft/Paxos) for consistent replication, gossip protocols for decentralized communication, CRDTs for conflict-free merging, and careful architecture design that balances consistency, availability, and latency. Dynamo-style systems optimize for availability with tunable consistency. Spanner-style systems optimize for strong consistency with global distribution. Choosing the right approach depends on your workload's consistency requirements and latency tolerance.