Raft Consensus

Raft is a consensus algorithm specifically designed for managing replicated state machines in distributed systems. It was developed as an alternative to the Paxos algorithm, with a focus on understandability and ease of implementation. Raft achieves consensus among servers or nodes in a network, ensuring that they agree on the sequence of commands that update the state machines.

The key aspects of the Raft consensus algorithm are:

Leader election: Raft elects a single leader node responsible for managing the log replication process. If a leader fails or becomes unreachable, the remaining nodes will automatically elect a new leader. You can see this in action with Aeron Cluster in the quickstart demo animation.
Log replication: The leader node receives client commands, appends them to its log, and then replicates the log entries across the follower nodes. Once a majority of nodes acknowledge the receipt of the log entries, the leader considers them committed and informs the followers.
Safety: Raft ensures that committed log entries are never lost and are eventually applied to all the state machines in the same order. It also guarantees that only one candidate can win the election for a given term, preventing any conflicts or inconsistencies.

Aeron Cluster offers — via premium modules — advanced cluster deployment configurations that extend Aeron Cluster's highly available capabilities further.

Raft's Log Replication Architecture¶

During normal operation, the flow from a client to the cluster is as follows:

The client process sends the command into the leader server consensus module
The leader consensus module appends the command to the tail of local log. In parallel, the consensus module replicates the commands to the follower nodes
Once a majority of nodes respond positively to the append log command, the command is committed and the committed command is then handed off to the Replicated State Machine for processing. If any existing messages are in the log, they will be processed first, in append order.
The Replicated State Machine produces any events and they are sent to the client(s) targeted.

Leader Election¶

At any time, a node is in one of three states: Leader, Follower or Candidate. During normal operation, one node is a leader and the remaining nodes are followers.

The cluster member state machine runs as per the diagram¹ below:

If a follower does not receive messages from the leader process before the configured timeout, the node will start an election. It then becomes a candidate node, and sends messages to all other nodes requesting a vote to confirm if the candidate node can become a leader. If the node receives a quorum majority vote, it becomes the leader.

You can test this interaction in the Aeron Cluster quickstart running within Docker by finding the leader and killing it:

First, you need to identify the leader. There are two ways to do this: the leader logs out "LEADER" once an election has found a leader, or use the docker_find_leader.sh script,
Kill the leader with docker compose kill nodeX where nodeX is the leader node.

You can also see this in action in the demo animation.

Timeouts in Aeron Cluster¶

In the original Raft paper, two essential timeouts manage leader election and ensure that the system's progress is maintained. They are:

Election timeout: This timeout triggers a leader election when a follower node has not received any communication from the current leader or other candidates for a specific period. When this timeout occurs, the node transitions to candidate status and initiates an election. This value is recommended to be a random value between 150 and 300 milliseconds.
Heartbeat timeout: This timeout is used by the leader node to maintain authority over the cluster and ensure that the follower nodes' logs remain up-to-date.

Aeron configuration allows more fine-grained control over these timeouts:

electionTimeoutNs: This is the election timeout, excluding the timeout used to detect a failed leader. In Aeron Cluster, this value is used in a simple computation that effectively sets the election timeout to a value between 0 and 500ms.
leaderHeartbeatTimeoutNs: This timeout is used by the follower nodes to detect a failed leader. In Aeron Cluster, the value defaults to 10 seconds and does not include any randomized elements.
leaderHeartbeatIntervalNs: This timeout is equivalent to the heartbeat timeout in the original Raft paper and defaults to 200 milliseconds in Aeron Cluster.

You will notice that the election-related timeouts (electionTimeoutNs & leaderHeartbeatTimeoutNs) are significantly higher in Aeron Cluster. This design choice is crucial for maintaining stable leadership terms (and thus processing messages in the logs) in environments with higher network latencies, such as the cloud, and Java GC pauses. The leaderHeartbeatTimeoutNs value must be configured — as driven by your system requirements, performance and infrastructure — to strike a balance between allowing nodes to remain stable leaders and the ability to quickly detect and recover from a failed node.

In the Aeron Cluster Quickstart, which runs on a local docker or kubernetes environment, the leaderHeartbeatTimeoutNs is reduced to 3 seconds in ClusterApp.java.

Client Data loss in Leader Loss scenarios¶

It is important to note that Raft clients can lose data during leader failure. To give an example, in a typical Aeron Cluster implementation, FIX engines such as Artio are often coupled with Aeron Cluster clients to construct a trading system. During a leader failure, it is possible for data sent by a client of the Raft cluster – such as an end user FIX session within the Artio FIX engine – to be lost should the timing work out that the leader failed at the moment the message was sent. With Aeron Cluster, the client implementation must make the necessary decisions about what to do - for example, the client may disconnect all connected FIX sessions and offer a protocol to verify trade and/or order status.

Within the Aeron Cluster Quickstart, you can see code that accounts for this scenario with the PendingMessageManager. The way it works is simple: a correlation ID field is available within the Cluster protocol, and this allows clients to track messages that are sent to the cluster, and mark them as received once the response is returned. If the response does not return within a timeout period, an error is logged.

Raft Consensus

Raft's Log Replication Architecture¶

Leader Election¶

Timeouts in Aeron Cluster¶

Client Data loss in Leader Loss scenarios¶

Further reading¶