Replicated State Machines solve the question as to how to ensure that processes that receive the same inputs in the same order will produce the same outputs. They do not however answer the question as to how to get a collection of different machines to act as one, with an agreed order of inputs despite node failures. Consensus algorithms step in to solve this problem.
There are several different approaches to achieving ordering of messages in a Replicated State Machine, including:
- Atomic Broadcast
- Viewstamped Replication
- Paxos, and
This section will focus on Raft, as used in Aeron Cluster.
Consensus algorthims like Raft are best used for state that must remain consistent, is under high contention and which requires resilience. They are however limited as to how far they can be scaled since each cluster node runs the same code. Adding more nodes just adds resilience - it does not increase performance. As a result, they should be used very sparingly - if at all. When needed, the cluster node processes should be very carefully managed - do not send data in or through the cluster unless absolutely required.
Raft is an evolution of the Paxos consensus algorithm, and has growing usage in distributed systems used for infrastructure (Hashicorp Consul, etcd and others) as well as business logic services such as a matching engine in a financial trading system.
Key aspects of Raft:
- there is a Strong Leader, which means that all log entries flow from the leader to followers
- Raft makes use of randomized timers to elect leaders. This adds a few milliseconds to failover, but reduces the time to agree an elected leader (in Aeron Cluster, this is a maximum of the election timeout * 2).
- the Raft protocol allows runtime configuration changes (i.e. adding new or removing nodes at runtime).
Note: Raft can result in data loss in the case of a leader failure. This is discussed further below.
Log Replication Architecture¶
Raft is an architecture for replicating logs. Much like Replicated State Machines, Raft clients send in commands and receive back events.
During normal operation, the flow is as follows:
- The client process sends the command into the leader server consensus module
- The leader consensus module appends the command to the local log. In parallel, the consensus module replicates the commands to the follower nodes
- Once at least quorum nodes responds positively to the append log command, the command is committed and the committed command is then handed off to the Replicated State Machine for processing.
- The Replicated State Machine produces any events and they are sent to the client.
At any time, a node is in one of three states: Leader, Follower or Candidate. During normal operation, one node is a leader and the remaining nodes are followers.
The leader state machine runs as per the diagram1 below:
If a follower does not receive messages from the leader process before the configured timeout, the node will start an election. It then becomes a candidate node, and sends messages to all other nodes requesting a vote to confirm if the candidate node can become a leader. If the node receives a quorum majority vote, it becomes the leader.
When a leader becomes a leader, it must ensure that all followers have a consistent log. It does this by find the newest log entry that the logs agree on, and delete any entries beyond that point. If the follower node was a leader, and it had committed entries during a network partition (in step 2 above), they would not be present in any follower nodes. These entries would be lost.
An application protocol on top of Raft is required to prevent data loss in this scenario.
- D. Ongaro and J. Ousterhout, “In Search of an Understandable Consensus Algorithm,” p. 18. GitHub
Diagram is from Heidi Howard and Richard Mortier. 2020. Paxos vs Raft: have we reached consensus on distributed consensus? In Proceedings of the 7th Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC '20). Association for Computing Machinery, New York, NY, USA, Article 8, 1–9. DOI:https://doi.org/10.1145/3380787.3393681 ↩