On Sharding

When building some classes of systems such as an OMS or an Exchange using Aeron Cluster, there is often a tension around needing to shard as volumes increase while also facing the impossibility to shard while keeping correctness guarantees.

Let's take the example of an exchange design that offers per symbol trading such as via a central limit order book and has strong credit limit requirements. Sharding on a per symbol basis is an obvious candidate to scale out the system since the interactions are with each symbol's central limit order book. Sharding by symbol is prevented though by the strong credit limit requirement. This requirement is a cross-cutting concern - the credit limits are observed for all interactions that a user or legal entity has, regardless of what symbols they interact with.

There are several mechanisms available to address situations like this:

  • don't shard, and invest heavily in performance of the cluster code. This option is by far the simplest and offers the lowest latency and highest availability profile. At some point though, the cluster will hit a performance limit.
  • consider loosening the rules around the credit matrix correctness; this must be a business driven decision specific to your wider architecture. Example approaches could include:
    • move the credit data to the gateways and accept the risk of periods of excess credit utilization
    • apply dynamic risk based rules with some form of inter-shard communication, and accept the risk of periods of excess credit utilization
  • shard, and extract the credit matrix into an external process; this could be constructed in several ways, depending on availability requirements. Some examples:
    • an extra cluster could be constructed just focused on credit matrices; all shards have a cluster client that acts as a cluster client to the credit matrix cluster. This would potentially double latency but would offer a high resilience profile
    • a shared cluster client could be constructed which is a client to all the shards and processes all credit matrix decisions; this would have lower latency but will offer minimal resilience to failure.
  • shard the Aeron Cluster itself into multiple clustered services. This allows you to run multiple services off of the same inbound Raft log. Each service receives the same message, and must decide for itself if the message should be consumed or ignored. Services can also communicate between each other (this is done using the same mechanism as the Alternative to Cluster Timers).
    • this approach offers lower latency than a fully independent cluster, however, care must be taken to ensure that the Raft log does not become the bottleneck
    • be aware that the multiple clustered services approach introduces some complex failure scenarios - for example, what happens if one of the clustered services fails? What happens if one is running slow? etc.

Other trading system architectures based upon Virtual Synchrony or Atomic Broadcast can offer a lower overall latency with additional services (such as centralized credit limits) and at least primary-backup resilience. These designs have a requirement for multicast and specialist hardware - this tends to exclude the design from cloud environments and non-specialist development teams.