Skip to content

Total Ordering Architectures

Every system in finance built today is a form of distributed system. Most do not need any form of distributed state – which helps keep the complexity, support and engineering costs down and helps increase the delivered system quality. Unfortunately, not all systems delivered in the finance sector can do this safely or cannot do it at all given the constraints forced upon them by system requirements. When this happens, system engineers must resort to using distributed state. Distributed state that is correct under typical operating conditions (including node and network failures) is one of the harder problems in computer science. Making safe, distributed state fast enough for financial systems adds an extra dimension to the challenge.

Total Ordering defined

When we build a system that executes as a single thread on a single computer it will process each input operation sequentially. As a result, the inputs have an implicit predictable order – the input operations have a total order. Taken together, building a system with a single thread on a single computer with total ordering on the inputs means we get a system that is easy to reason about, is likely to be correct, is simpler to engineer and is easier to maintain.

By building a distributed system – or parts thereof – with total ordering, we aim to have multiple processes distributed over multiple computers acting as if they were the same single threaded, single process with all of the benefits plus some additional ones – most notably resilience.

Note:

  • Global totally ordered systems can rarely be scaled out and will only support a limited degree of scale up. Sharding - if possible given the business requirements - is the primary technique to achieve scale out.
  • Total Ordering and Consensus are different concepts. You can achieve Consensus using Total Ordering.
  • In all cases, the ordered messages are submitted to a Replicated State Machine.
  • Total Ordering and Consensus are not be able to protect against a faulty or malicious process. If that's required, you would be better served by another approach.

Approaches to total ordering

Three approaches are common:

  • Atomic Broadcast: A central process is created that observes every inbound message and stamps it with a sequence before any other system component is allowed to process it. Other system components are required to process messages in the order given by the sequence;
  • Distributed Consensus: A formally proven distributed communications protocol such as RAFT delivers total ordering over a high speed/low latency medium; business logic is tied directly into the RAFT commands and receives them totally ordered;
  • External Ordered Queues: When the system requirements allow, input operations may be sharded into multiple independent FIFO queues that individually offer total ordering properties. Each subsequent downstream process must process the operations in the order provided by the FIFO queue.

Atomic Broadcast

There are several different implementations of atomic broadcast out there. One example is using a Sequencer process. Gateways send all messages onto an internal multi-cast network, and all services receive them. If not yet sequenced, services other than the Sequencer will drop the message. The Sequencer will listen for these messages, add a sequence number, and send it back onto the internal multi-cast network. Services will then process the messages in the order set by the sequencer, even if they arrived out of order. If a Service missed a message, there is typically a secondary port (sometimes TCP/IP) that is available for replays. A secondary Sequencer process is kept up-to-date by the primary Sequencer. Should the primary Sequencer fail, the secondary Sequencer takes over.

atomic-broadcast Sequencer Backup Sequencer Service 1 Gateway Service 2 A.1 A.1 A.2 A.3 A.2 A.3 A-A-A A-A-A A-A-A

Distributed Consensus Architecture

Distributed Consensus based architectures give the same effect but with a different approach. A gateway sends a command to the cluster leader node, which replicates it to followers, and only once quorum followers have accepted the command, is the business logic given sight of the message. Some systems keep the business logic internal to the cluster, while others use the cluster as high-availability sequencer. In either case, multi-cast can be used to increase performance, but this is optional – which makes this architecture better suited for cloud environments.

consensus Cluster Node Follower Cluster Node Follower Cluster Node Leader Gateway A.1-A.2-A.3 A.1-A.2-A.3 A-A-A A.1-A.2-A.3

External Ordered Queues

Externally Ordered Queue based architectures are another approach to achieving total ordering. They rely on an external service such as Kafka to provide total ordering and replay. Communication between services is typically TCP/IP based. These systems tend to have high latency along with high throughput, and like the Sequencer approach, can allow for easier scale out of services.

kafka Service 1 Service 2 Kafka or similar Gateway Queue A-A-A A.1 A.2 A.3 A.1 A.2 A.3

Approaches compared

Each of these approaches has their own pros and cons:

Approach Pros Cons
Atomic Broadcast Lowest latency; systems with this architecture can have 99th percentile latencies <10μs using specialized hardware and dedicated servers Slow recovery – loss of more than a handful of messages can result in lengthy replays from slower media such as disk
Easy to scale out as each service can be its own process Challenging to engineer
High expense
Typically, only run on dedicated hardware
Distributed Consensus Tunable resiliency – operational process to decide on how many node failures the system can accept Cloud friendly
Increased network hops result in higher latency. This architecture can achieve latencies 99th percentile latencies of 30-40μs using specialized hardware and dedicated servers.
Cluster hosting the business logic is ultimately the bottleneck.
External Ordered Queues High degree of sharding means the entire system can likely achieve a higher throughput Latency reaches tens to hundreds of milliseconds when using hardware (for example Solace) or software queues (for example Kafka)