Consensus Algorithms

Replicated State Machines solve the question as to how to ensure that processes that receive the same inputs in the same order will produce the same outputs. They do not however answer the question as to how to get a collection of different machines to act as one, with an agreed order of inputs despite node failures. Consensus algorithms step in to solve this problem.

There are several different approaches to achieving ordering of messages in a Replicated State Machine, including:

Atomic Broadcast
Viewstamped Replication
Paxos, and
Raft

This section will focus on Raft, as used in Aeron Cluster.

Limitations¶

Consensus algorthims like Raft are best used for state that must remain consistent, is under high contention and which requires resilience. They are however limited as to how far they can be scaled since each cluster node runs the same code. Adding more nodes just adds resilience - it does not increase performance. As a result, they should be used very sparingly - if at all. When needed, the cluster node processes should be very carefully managed - do not send data in or through the cluster unless absolutely required.

Raft¶

Raft is an evolution of the Paxos consensus algorithm, and has growing usage in distributed systems used for infrastructure (Hashicorp Consul, etcd and others) as well as business logic services such as a matching engine in a financial trading system.

Key aspects of Raft:

there is a Strong Leader, which means that all log entries flow from the leader to followers
Raft makes use of randomized timers to elect leaders. This adds a few milliseconds to failover, but reduces the time to agree an elected leader (in Aeron Cluster, this is a maximum of the election timeout * 2).
the Raft protocol allows runtime configuration changes (i.e. adding new or removing nodes at runtime).

Note: Raft can result in client data loss in the case of a leader failure. This is discussed further below.

Log Replication Architecture¶

Raft is an architecture for replicating logs. Much like Replicated State Machines, Raft clients send in commands and receive back events.

During normal operation, the flow is as follows:

The client process sends the command into the leader server consensus module
The leader consensus module appends the command to the local log. In parallel, the consensus module replicates the commands to the follower nodes
Once at least quorum nodes responds positively to the append log command, the command is committed and the committed command is then handed off to the Replicated State Machine for processing.
The Replicated State Machine produces any events and they are sent to the client.

Leader Election¶

At any time, a node is in one of three states: Leader, Follower or Candidate. During normal operation, one node is a leader and the remaining nodes are followers.

The leader state machine runs as per the diagram¹ below:

If a follower does not receive messages from the leader process before the configured timeout, the node will start an election. It then becomes a candidate node, and sends messages to all other nodes requesting a vote to confirm if the candidate node can become a leader. If the node receives a quorum majority vote, it becomes the leader.

Client Data loss in Leader Loss scenarios¶

Raft clients can lose data during leader failure. To give an example, in a typical Aeron Cluster implementation, FIX engines such as Artio are often coupled with Aeron Cluster clients to construct a trading system. During a leader failure it is possible for data sent by a client of the Raft cluster - such as an end user FIX session within an Artio FIX engine - to be lost. The client implementation must make the necessary decisions about what to do - for example, the client may disconnect all connected FIX sessions and offer a protocol to verify trade and/or order status.