Skip to content

High Availability

Open Source Capabilities

Aeron Cluster is commonly deployed in configurations of 3 to 5 nodes, with all nodes situated in close proximity, such as within the same data center or the same cloud environment's availability zone. Maintaining this close arrangement minimizes the network overhead required to achieve quorum for committing logs, thus ensuring optimal performance with minimal impact.

The drawback of this approach is that it concentrates the risk of failure within the same availability zone or data center. Although it is possible to extend the cluster by adding a node in a remote data center, this may result in more frequent cluster back pressure events. The increased physical distance and associated network latency can hinder the leader's ability to commit new log entries efficiently.

Cluster Backup

Aeron Cluster offers built in Cluster Backup capabilities. This allows an active cluster's log to be replicated to a remote location, from which the cluster state could be rebuilt via log replay should a complete datacenter loss occur. Recovery time from a cluster backup is driven by a number of factors, including the age of the most recent snapshot, the amount of log entries to replay, and the length of time it takes to replay them.

A minimal sample snippet of Cluster Backup:

public static void main(final String[] args)
{
    final ShutdownSignalBarrier barrier = new ShutdownSignalBarrier();
    final String rootDir = System.getProperty("user.dir");
    final Path clusterBackupPath = Path.of(rootDir, "cluster-node");
    IoUtil.delete(clusterBackupPath.toFile(), true);

    // local process setup
    final AeronArchive.Context localArchiveContext = new AeronArchive.Context()
        .controlRequestChannel(CommonContext.IPC_CHANNEL)
        .controlResponseChannel(CommonContext.IPC_CHANNEL);

    // remote connectivity
    final AeronArchive.Context clusterArchiveContext = new AeronArchive.Context()
        .controlRequestChannel("aeron:udp?endpoint=cluster:10101")
        .controlRequestStreamId(10)
        .controlResponseChannel("aeron:udp?endpoint=cluster:10102")
        .controlResponseStreamId(20);
    final ClusterBackup.Context ctx = new ClusterBackup.Context()
        .errorHandler(Throwable::printStackTrace)
        .deleteDirOnStart(true)
        .archiveContext(localArchiveContext)
        .clusterArchiveContext(clusterArchiveContext)
        .clusterDirectoryName(clusterBackupPath.resolve("cluster").toString())
        .clusterConsensusEndpoints("cluster:10106")
        .consensusChannel("aeron:udp?endpoint=cluster:10006")
        .consensusStreamId(108)
        .catchupChannel("aeron:udp?endpoint=cluster:10005")
        .logStreamId(ConsensusModule.Configuration.logStreamId());

    //use an instance of this to monitor backup progress, and exit at a set log position, if required
    ctx.eventsListener(new ClusterBackupEventsListener(barrier));

    try (
        final ArchivingMediaDriver mediaDriver = ArchivingMediaDriver.launch(
            new MediaDriver.Context().dirDeleteOnStart(true),
            new Archive.Context()
                .archiveDirectoryName(clusterBackupPath.resolve("archiver").toString())
                .deleteArchiveOnStart(true)
        );
        final Aeron aeron = Aeron.connect();
        final ClusterBackup clusterBackup = ClusterBackup.launch(ctx)
    )
    {
        barrier.await();
        System.out.println("Shutting down...");
    }
}

The quickstart includes a full cluster backup example.

Premium Capabilities

In addition to its core open-source features, Aeron Cluster offers premium modules that enhance performance, security, and high-availability, further extending its capabilities for use in the most demanding environments.

Cluster Standby

Aeron Premium introduces Cluster Standby capabilities, facilitating expanded deployments across data centers without necessitating the replay of extensive cluster logs or causing back pressure on the cluster leader. Like an active node, each standby cluster node processes every message, ensuring internal state consistency with the primary cluster as quickly as the logic and network allow.

In the event of a complete data center loss, this approach enables rapid recovery. A flag simply needs to be set (typically by a human operator), allowing the surviving cluster nodes to be switched into active mode. Data loss is limited to information that was in transit, and not yet committed on standby nodes at the time of the failure.

Cluster Standby deployments provide significant flexibility, allowing configurations that balance bandwidth costs and resilience. By routing all traffic through a single node, bandwidth costs can be minimized, while having all nodes receive the same data increases resilience. Configurations can be customized to achieve the optimal balance between these objectives.

Reduced Bandwidth Cost

Increased Resilience