Leader Election in Distributed Systems


Leader election is the mechanism by which distributed systems select a single node to coordinate work or make decisions on behalf of the group. It is essential for systems where exactly one node must perform certain operations — assigning monotonically increasing sequence numbers, performing periodic maintenance, or managing group membership changes. The leader must be unique at any time, and the system must remain available despite leader failures.

Leader election algorithms are broadly categorized into two types: consensus-based and lease-based. Consensus-based elections use Paxos, Raft, or Zab to guarantee that exactly one leader exists at any epoch. These provide strong safety guarantees but require a quorum of nodes for election decisions. Lease-based elections use a distributed lock with a TTL — the holder is the leader until its lease expires. These are simpler but vulnerable to split-brain if clock skew exceeds the lease duration.

ZooKeeper implements leader election through ephemeral sequential znodes. Multiple candidates create an ephemeral sequential znode under an election path. The candidate with the smallest sequence number becomes the leader. Other candidates watch the preceding znode. When the leader's session expires (detected through ZooKeeper heartbeats), the ephemeral znode is deleted, and the next candidate in sequence becomes leader. Leader election commands are available directly through the ZooKeeper API, and libraries like Apache Curator provide production-ready implementations.

Etcd's leadership election uses the concurrency package's mutex, which is built on etcd's linearizable read and write guarantees. A leader candidate attempts to acquire a lock associated with a lease. The lease TTL defines the leader's maximum tenure without renewal. The leader must periodically refresh the lease. If the lease expires, the lock is released, and a new leader can acquire it. Etcd's Raft consensus ensures that the lock state is consistent across all nodes.

Kubernetes provides its own leader election mechanism based on ConfigMaps or Endpoints (Lease objects in newer versions). The candidate creates or updates the Lease with its identity. The Lease has a holderIdentity, leaseDurationSeconds, acquireTime, and renewTime. Each candidate reads the Lease. If the Lease is expired (current time exceeds renewTime + leaseDurationSeconds), the candidate attempts to become the leader by updating the Lease. This is lightweight — requiring only API server access rather than a separate consensus cluster.

The "Fencing" bookend is critical but often overlooked. When a new leader is elected, it must ensure that the old leader is truly stopped or fenced before proceeding. Without fencing, both leaders could operate simultaneously (split-brain), causing data corruption. Fencing strategies include: revoking the old leader's access credentials, terminating its process, or using fencing tokens that the resource refuses if presented by a stale leader.

Failure handling must account for various scenarios. Network partitions can isolate the leader from some followers while others can still reach it. The system should use a majority quorum to ensure that even if the leader is isolated from some nodes, only one leader can operate. ZooKeeper and etcd both enforce majority rules — a leader needs a majority of votes to be elected and to commit operations.

Leader transitions should be graceful. During transition, the system is in a window of unavailability. Client requests to the old leader will fail, and requests to the new leader will be queued until the leader's state is initialized. Readiness probes should reflect the leader's initialization status. Clients should implement retry with backoff to transparently survive leadership changes.

The scope of leader responsibilities must be clearly defined. A single leader per cluster avoids conflicts but limits throughput. Partitioned leadership assigns different leaders for different shards or partitions. For example, Kafka uses one leader per partition, allowing parallel leadership across partitions within the same cluster. This pattern scales beyond what single-leader systems can achieve.

Observability of leadership state is essential. Expose leadership status as a metric. Log leadership transitions with reasons (startup, lease expiry, voluntary stepdown, failure). Alert on frequent transitions or extended periods without a leader. Monitor the time between leader failure and new leader election — this recovery time is the window of unavailability for leader-dependent operations.