Skip to content

Design Decisions

Aeron Cluster hosts a Replicated State Machine. For Aeron Cluster to correctly operate, your code needs to strictly follow the rules of Replicated State Machines:

  • State must be held internally. There can be no external I/O operations, even to seed static data
  • The state machine must accept commands that address a single method
  • Deterministic methods process each command. This means no unseeded random number generators and careful usage of data structures
  • Internally, the commands can mutate state and/or produce events.

In addition to the above, Aeron Cluster latency and throughput is impacted by the time your code takes to execute.

Some principles to keep in mind:

  • The less the state machine has to do, the faster it can operate. This includes:
    • Don't manage or hold state that is not necessary for state machine operation. Operational concerns such as monitoring and logging may impact this decision as well.
    • Consider where code has to run. For example, if you need to customize an event for many recipients, is it better to have the code run in the state machine, or on a gateway?
  • Data structure selection can have a huge effect on performance
  • In the JVM, garbage collection can severely impact latency profiles, especially at 99% and higher percentiles. Build your code to avoid garbage collection to reduce the spread between average and max latencies.

And a note on Commands versus Events:

  • Commands are inputs to the Replicated State Machine and are sent by Aeron Cluster gateways which are connected via the Aeron Cluster Client.
  • Events are outputs of the Replicated State Machine. They can be broadcast to all connected gateways or sent to a specific gateway.

Aeron Cluster vs Three Tiered

In a typical three tier system, the data and business logic are independent. The golden source lives within the database engine, and the business logic is as stateless as it can be. By removing the state from the server layer, we can add more instances of the servers to meet added demand. Each server is typically multithreaded, and developers must take care of what they do within the business logic layer. As the user interface makes server requests, any of the available servers can process the request safely. The database cluster’s responsibility is to keep data consistent, safe, and resilient. If any ordering of inbound data is required, this must be done within the database.

three-tier Server Business Logic Server Business Logic Server Business Logic User Interface API Client Database Cluster Data JSON DB Specific

This differs from a typical Aeron Cluster application. Aeron Cluster applications put the business logic in the same process as the data. This can include data sourced from external systems, or data that has the cluster as the golden source. Databases are entirely optional, and many implementations do not use a database at all. If they do, the database activity is completely outside the hot path of the cluster. All interactions with the cluster move via Gateway processes – processes which on one side communicate with Aeron Cluster over Aeron using a user defined protocol, and on the other side with some external process such as a User Interface, API client or database cluster.

cluster Cluster Leader State Machine Cluster Follower State Machine Cluster Follower State Machine API Gateway UI Gateway Database Gateway User Interface API Client Database Cluster Data JSON Other Proprietary over Aeron DB Specific Optional Database

Decisions for the RFQ Cluster

Performance can be relatively relaxed

Given the sample is for a Request For Quote model, performance is not of utmost importance.

SBE will be used for the protocol, with simple Java objects for the domain model

The sample will be using a protocol based upon Simple Binary Encoding, and simple plain old Java objects in the domain model.

Logic in the Gateways

To keep cluster logic simple, it is assumed that Gateway processes will include logic to customize outbound messages per user, as needed. See Gateway Distribution for more on this topic. This can be seen by the general principle of broadcasting state changes from the cluster, and decorating them with enough data for gateways to route to specific users, and replying with confirmation or when errors are raised.

...
final Rfq rfq = new Rfq(++rfqId, correlation, expireTimeMs, quantity, side, cusip, userId);
rfqs.add(rfq);
LOGGER.info("Created RFQ {}", rfq);

//send a confirmation to the client that created the RFQ
clusterClientResponder.createRfqConfirm(correlation, rfq, CreateRfqResult.SUCCESS);

//broadcast the new RFQ to all clients
clusterClientResponder.broadcastNewRfq(rfq);

//schedule the RFQ to expire
timerManager.scheduleTimer(rfq.getExpireTimeMs(), () -> expireRfq(rfq.getRfqId()));
...

Users are hard coded

The cluster design assumes that users are known to the cluster, though are managed within the gateways. The cluster has a set of hardcoded user ids (500,501,502). In a real-world system, the user data could be either fully managed in the gateways or within the cluster. It would not be hardcoded.