Skip to content

Cluster Errors

Aeron Cluster Exceptions

Aeron Cluster Exceptions (io.aeron.cluster.client.ClusterException) are raised with different categories:

  • WARNING which suggest something went wrong, but Aeron Cluster will usually (but not always) recover from;
  • ERROR which are serious problems that require corrective action or process restart;
  • FATAL which are terminal. A process restart is required to recover from this.

Some common errors are described below.

Note

It helps to have read, and understood, the Raft Consensus Algorithm paper to understand some of the concepts mentioned below (leader, follower, election, quorum, cluster log etc).

Leader heartbeat timeout

This WARNING is found on follower nodes, and happens when the leader has been out of contact for a period of time longer than the leader heartbeat timeout.

There can be many reasons for this, including:

  • the leader node has failed
  • the leader node has been cut off from this follower node via a network partition

Following the raising of this error, the follower will initiate an election.

The leader heartbeat timeout can be controlled via the aeron.cluster.leader.heartbeat.timeout system property, or setting it on the cluster context ConsensusModule.Context.leaderHeartbeatIntervalNs(long value).

No catchup progress

This WARNING happens when the commit position in the cluster log is not advanced for a period of time longer than the leader heartbeat timeout. You can observe the position in near real-time via AeronStat as the Cluster commit-pos. See Understanding Cluster Counters

Aeron Cluster processes messages in batches (100 by default, although this can be changed), and should the processing time of the batch exceed the leader heartbeat timeout, then the exception will be raised. This slow processing can be due to many reasons, including slow application code, a heavy machine load and extended garbage collection times.

As with the Leader heartbeat timeout, the timeout value can be controlled via the aeron.cluster.leader.heartbeat.timeout system property, or setting it on the cluster context ConsensusModule.Context.leaderHeartbeatIntervalNs(long value).

Inactive follower quorum

This WARNING happens when the leader no longer has an active quorum of cluster members, and causes an election to be started.

The number of nodes required for quorum varies based on the size of the cluster. For example, a 3 node cluster requires 2 nodes for quorum.

There can be many reasons for this, including:

  • follower nodes have failed
  • the leader process has been cut off from follower nodes via a network partition

Unexpected vote request

This WARNING happens when a node receives a request to vote for a candidate term that's newer than the current term. The vote is unexpected because the node was not in an election at the time. The node receiving this will enter into an election and process the vote.

There can be many reasons for unexpected votes, including:

  • the leader had a heartbeat time out or found itself with an inactive follower quorum (see above)
  • a follower started an election after losing contact with the leader, and this node did not lose contact or had not yet timed out

Unexpected new leadership term

This WARNING happens when a node receives a New Leadership Term message, and the new leadership term is newer than the leadership term the node is currently aware of. This will cause the node to enter into an election.

Cluster Termination Exception

Aeron Cluster Termination Exceptions (from package io.aeron.cluster.service, class ClusterTerminationException) are raised during abort or termination situations. In normal scenarios you should not expect to encounter these exceptions, but they could be thrown in some unexpected termination scenarios, including:

  • incompatible timestamp units in a new leadership term
  • incompatible protocol versions in a new leadership term
  • cluster storage space is exhausted for an underlying Aeron Archive (for example, log or snapshot)
  • if Aeron Cluster encounters a closed Aeron client during operations in the Consensus Module
  • if the Consensus Module archive subscription is not connected

These errors are subclassed from AgentTerminationException and will result in the termination of the Agrona Agent (such as the cluster's ConsensusModule) running the code.