Replicated State Machines¶
Replicated state machines are a generalized approach for building fault-tolerant services with distributed systems. If you are familiar with the Gang of Four Design Patterns book, it is worth noting that the State Machine pattern described in the book has nothing to do with the way you would go about implementing a Replicated State Machine.
A Replicated State Machine is a software component that:
- Encodes its states internally
- Accepts commands that address a single method
- Deterministic methods process each command
- Internally, the commands mutate state and/or produce events.
If we have multiple Replicated State Machines that strictly follow the above rules, we can be certain that if they all start in the same initial state and they all process the commands in the same order, then they will have the same output events and the same internal state. The challenge then becomes how to get different machines to agree on the order of commands all while receiving concurrent client requests and in spite of failures. This is the subject of Consensus Algorithms later.
Additional features common to Replicated State Machines:
- Snapshots - this allows a state machine internal state to be captured at a moment in time
- Set - this allows the state machines internal state to be set to a given value.
These two additional features are important for the runtime operational aspects of a Replicated State Machine. If you wanted to capture the state of a state machine without needing to replay every command it had ever received since the initial state, you could take a snapshot of the internal state, save that to some place, and truncate every input command up-to-and-including the snapshot command. Then, when you want the state machine to recover its internal state, you would set the internal state to the value produced by the snapshot.
Sample State Machine¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
For sake of simplicity, let's assume that the
EventListener just logs the current state:
1 2 3 4 5 6 7 8 9
If we then run following the commands:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
We can be certain for every scenario in which we run the set, followed by the multiply, followed by the add then the internal value will be 37. If we changed the order to set, then add, then multiply the internal value would be 72.
Building your clustered service code as a deterministic Replicated State Machine is required for the correct operation of Aeron Cluster. With Aeron Cluster you will define your own commands and events, along with code to build a snapshot and restore from a snapshot.
It is also important to note that the techiques of using a Replicated State Machine can be successfully deployed anytime you have a total ordering of input events and are fully deterministic. Aeron and Aeron Archive provide total ordering capablities within a single channel and stream.
Source code for the above is found in the
theory project in the
- F. B. Schneider, “Implementing fault-tolerant services using the state machine approach: a tutorial,” ACM Comput. Surv., vol. 22, no. 4, pp. 299–319, Dec. 1990, doi: 10.1145/98163.98167