Skip to content

Liveness Detection

Before we can discuss liveness detection, we first need to define what is meant by liveness and liveness detection:

  • Liveness is a property of a distributed system. As a property, Liveness relates to a guarantee that something good will eventually happen in a system. This might be that processing a request eventually happens, a response will eventually be returned, or detecting a failed node will eventually happen.
  • Liveness detectors are components within a distributed system that aim to ensure liveness by continuously monitoring the nodes or processes in a distributed system.

Liveness detectors that span threads or processes are typically imperfect and can produce incorrect results. Not only can a single failure detector produce incorrect results, but different failure detectors can provide inconsistent results about the same monitored process. Three examples are shown below:

  • If A sends a heartbeat to B, but B responds after the timeout, then the process has not failed — it just failed to respond within the required timeout period.
  • If A sends a heartbeat to B, but B crashes moments after responding that it is alive — then it has failed but appears (for some time at least) to be alive.
  • You can also get mixed results. If A sends a heartbeat to B, but B never receives it because of some failure, while C sends a heartbeat to B and does receive it, then A and C have divergent views on the current status of B. Who is correct?

As can be seen, understanding the status of a remote process is a complex problem and is typically very specific to a system's topology & protocol design. As a result, detecting liveness is an application-level concern with Aeron Transport/Aeron Archive-based systems. Aeron’s connection semantics (such as isConnected of a publication) are inappropriate for use as a liveness detector.

A range of well-known liveness detectors can be adopted at the application layer. It is up to the application team to define their requirements and then match those requirements to an existing liveness detector or define a custom one that meets them. One strategy may be suitable within a single process, while a different strategy may be more applicable across processes and nodes.

A sample of liveness detectors approaches:

  • Heartbeats. The simplest and easiest to reason-about approach. This is used within Aeron itself, for example, between the Aeron client and Media Driver. Nodes or threads will typically notify other nodes or threads within the system that they are alive by sending a heartbeat. If the heartbeat is not received by the recipient within the predefined timeline, then that recipient can make a reasonable assumption that the sender has failed. This approach is best suited for smaller numbers of threads and nodes.
  • Gossip Protocols. In this model, nodes will typically send a heartbeat to a subset of all the other system nodes, and the liveness status emerges over time as the status propagates through the system. You can combine liveness detection with membership as well, with protocols like SWIM with Lifeguard.
  • Failure Detectors, for example, the Phi Accrual Failure Detector. A Phi Accrual Failure Detector performs statistical analysis on the deviation of current heartbeat inter-arrival times versus the mean inter-arrival times in order to determine the probability (or suspicion) that a process has failed. This approach adjusts to system load and network conditions and allows for a more adaptive approach than timeouts with strict heartbeats. The Phi Accrual Failure Detector is used in some open source systems such as Akka Cluster and Hazelcast.
  • Ring Algorithms, where nodes are arranged in logical rings, and neighbors only check their next neighbor in the ring. If there is no response within a strict timeout, the node is assumed to be down.
  • Hierarchical or Tree-based Algorithms. These are best suited for large scale systems or in systems with large amounts of cross-process connectivity. In this model, each node has an understanding of its children and parents, and will check the status down the tree while reporting progress up the tree. Status checks may be simple heartbeats, or they may be more intensive self-tests. The top level node(s) will eventually gain an understanding of which, if any, processes have failed.

Each of these approaches has its own trade-offs and can provide varying degrees of liveness accuracy. An application team needs to adopt the right approach for the system.