Efficient Business Logic¶
Correctly and efficiently adding Business Logic to Aeron Cluster can be challenging, but can be simpler when the following rules are applied:
- Consider the cluster's bounded context
- Select the cluster codec carefully
- Implement your business logic using the best suited data structures and techniques
- Push validation to the Gateways
- Externalise long running tasks
- Use timers to break apart long running tasks that cannot be externalised
- Keep the hot path clean
- Shard processing
A number of these rules relate to a battle against Little's Law - see Performance Limits.
There is an additional, unbreakable rule: whatever business logic you write MUST be deterministic. Without deterministic business logic, you are likely to suffer data loss in the event of a leader failure.
Consider the cluster's bounded context¶
The Bounded Context pattern from Domain Driven Design is a good way to consider the responsibilities and logic held within the cluster. Focus the logic in the cluster on core system state - there is little value in processing stateless logic in the cluster.
If you're finding that you need multiple bounded contexts (for example, matching and settlement functionality), consider breaking apart the cluster into two independent clusters. This introduces new complexities, but can result in higher performance and easier development and maintenance. With two clusters, you would require a cluster client gateway to join the two clusters - taking into consideration protocol versioning and deployment complications.
Select the cluster codec carefully¶
Aeron Cluster operates with an onion like architecture, typically with 4 layers:
- FIX, JSON, Protobuf, Avro, or whatever format required is used for external connectivity
- External connectivity is made via Gateway processes, which are cluster clients
- Cluster clients make use of the cluster's internal codec to speak to the Cluster
- And in the center, is the cluster itself.
Aeron Cluster business logic should never interact directly with any I/O unless via a cluster client. As always there are exceptions to this rule (logging, metrics, observability for example), but this is a good baseline.
The internal cluster codec should be designed and implemented in such a way that it takes minimal processing time within the business logic. Using Simple Binary Encoding or similar, this time can be kept to single or double digit nanoseconds.
If you were to use something like JSON here, even the most efficient libraries still parse messages in hundreds of microseconds. If we imagine that parsing a cluster command sent in JSON were to take 600μs, then the cluster throughput is limited to ±1,667 requests/second. If we spent 50ns on this task, cluster throughput is limited to 2,000,000 requests/second. Find a calculator for this here.
Implement business logic using the best suited data structures and techniques¶
As seen with the careful selection of cluster codecs, the time spent processing a command has a large impact on the maximum cluster throughput.
If you've selected a codec like Simple Binary Encoding, you'll want to carefully consider how your business logic operates internally. You'll need to ask questions such as:
- which data structures work best for the logic, given the data volumes? For example, do you use Maps, TreeMaps, Arrays etc?
- what data types are allowed in the cluster? For example, Strings can lead to high amounts of GC. Do you need to use them?
- do you want to pre-allocate everything upfront and work off of DirectBuffers you manage yourself? Or do you put your trust in the Java virtual machine's GC?
- how are you logging? How does your logging library impact performance? And can it lead to large GC pauses?
Push validation to the edges¶
By pushing validation to the edges of the system - both in terms of ensuring that commands are valid as well as the cluster bootstrap data (see Databases), your cluster logic can be simpler, and likely operate faster. Validation must always happen first, since rolling back the processing of a message can be highly complex - for example, if you've already submitted a trade for booking, but then the logic realises that a credit limit was breached, you now have to find a way to cancel the trade. Validating first removes the need for dealing with complex rollback scenarios.
Externalise long running tasks¶
Sometimes we require complex, very expensive computations to be performed on data within a cluster. This should be externalised where possible so that the cluster is protected, and throughput is not impacted.
Techniques for this vary by the nature of the computation, but a common approach is to prepare a snapshot of all state required for the computation and send it as a single work packet to a specialist gateway. After preparing the snapshot, version and/or lock the relevant state so that the business logic can know if the result can be safely accepted and applied on return from the gateway. Use a cluster timer to monitor process of the external task - for example, set a timer to expire in 500ms, and if the result has not been received yet, retry. Once the results are returned, cancel any relevant timers and apply the changes to the cluster state if the version and/or lock allow.
Use Timers or messages to self to break apart long running tasks¶
Sometimes tasks have to be performed within the cluster, and cannot safely be sent externally. In this scenario, instead of processing the command as a single long running task, break it down. While a long running task is being performed, the cluster cannot process commands and latency can spike.
Store the required state regarding the long process in the cluster, and schedule the task in chunks via the cluster timer. This allows processing of other commands to continue, and can lower the impact on latency. It's best to schedule the next sub-task after the completion of the last processed sub-task, which can either be done via a timer or through sending a message to the cluster.
Keep the hot path clean¶
When we're creating new side effects in Aeron Cluster, keep the hot path clean of any synchronous tasks. For example if you're booking a trade and want to store it in the database, do not hold the cluster for a database write. Rather send the write to a gateway which can write to the database. Trust the cluster to retain your state - this requires that the unbreakable rule of deterministic business logic is held.
If volumes and latency require it, processing can be sharded. Doing so requires careful consideration as to what you are sharding on - for example in a financial exchange, do you shard on instrument, on groups of people, on a location or via something else? This decision is specific to each implementation, but is a tactic that can be employed when needed. Consideration needs to be put in around Gateway design and the impact on external users and systems - for example, are the clusters fully independent, or do gateways speak across clusters?
For more on sharding, see On Sharding