Aeron & Aeron Cluster Troubleshooting¶

Note

This is focused on diagnosis of issues on Linux based systems. Any tools and checklists here are shared without knowledge of the specific application and environment you're using, and may not be appropriate for the specific use case. If you're running Aeron in a production system, commercial support should be strongly considered.

Quick Check¶

Process is still running¶

Command	Intent
`cluster tool errors`	(Aeron Cluster only) Check for any errors raised by the Cluster or Media Driver. See Operating Aeron Cluster
`aeron stat`	Check key counters: Media Driver age, Cluster Errors, counters for ingress and egress streams. See Cluster Counters and Aeron Stat
`vmstat -SM 1`	Useful to confirm system wide status. Check CPU run queue length (`r` column), system swap usage, overall CPU usage
`mpstat -P ALL 1`	Useful to confirm system wide CPU starvation issues. Check for overall share of tasks; a single busy CPU can indicate poor configuration or an unexpectedly spinning thread
`jps`	Standard Java tool to retrieve the PID of Java processes
`jstack`	Standard Java tool. Check thread activities for anything out of the ordinary. The idle strategy should be the most commonly observed activity
`pidstat -t 1`	Useful to confirm CPU starvation, disk I/O and memory issues at the thread level. Check any thread for unexpected CPU usage, swapping and IO waits
`iostat -sxz 1`	Useful to confirm system wide IO issues. Check for high io wait times on key devices.
`free -m`	Check for high memory and swap usage
`sar -q 1`	Useful for further details on CPU starvation.
`sar -n UDP 1`	Check for high amounts of datagram errors
`GC logs`	Check for any extended pauses

Process is no longer running¶

Command	Intent
`dmesg -T \| tail`	Check for any OOM killer or related messages
`core dump logs`	If present; use `jstack` to review any unexpected thread activity in the core dump
`GC logs`	Check for any extended pauses that might exceed Aeron timeouts
`application logs`	Check for any Aeron errors

Aeron Resource Sensitives¶

Media Driver¶

The Media Driver is made up three Agrona Agents running on one to three threads, depending on your configuration:

the receiver: this interacts with log buffers (which should be on /dev/shm), and interacts with the network card
the sender: this also interacts with log buffers, and interacts with the network card
the driver conductor: this yet again interacts with log buffers, and performs DNS name resolution.

All three of them may be subject to one or multiple Idle Strategies - this is visible in jstack in the Java Media Driver.

Aeron Client¶

Beyond the Aeron objects used within the application, the Aeron client also operates an Aeron Client Conductor for interactions with the Media Driver's Driver Conductor. If this Client Conductor is unable to get CPU time due to CPU starvation, Aeron timeouts (Client Liveness) may occur.

Publications and Subscriptions may also be subject to disk I/O issues since they directly interact with log buffers.

Aeron Archive¶

Aeron Archive adds in an additional agent for the Aeron Archive Conductor, and makes use of disk I/O for the Aeron Archive data. All network I/O is performed via the Media Driver. Aeron Archive can be impacted by CPU starvation, memory and disk I/O issues.

Aeron Cluster¶

Aeron Cluster adds in additional agents for Aeron Archive, the Consensus Module and the Clustered Service, and makes use of disk I/O for the cluster Raft logs (which should never be on /dev/shm) and the mark file. All network I/O is performed via the Media Driver. This means that Aeron Cluster can be impacted by CPU starvation, memory and disk I/O issues.

Common causes of issues¶

Given the above, the primary causes of issues are:

CPU starvation, in which the system is so busy that the agents get little to no CPU time. Extended GC pauses can also cause this.
Disk I/O issues when reading/writing to the log buffers, archive files, cluster logs or mark files.
Memory issues due to either no more free memory (expect a visit from the OOM Killer or a core dump) or very slow access due to heavy paging
Network issues with high short sends, data loss or slow DNS (when resolving in the driver conductor)

Diagnostic Tools¶

Linux offers a number of tools via the sysstat package to assist with diagnosing issues such as those listed above. These will be the primary tools used below due to their relative simplicity.

vmstat¶

vmstat -SM 1 is useful to obtain a quick understanding of how busy the machine is and the state of memory and swap usage. The arg -SM uses megabytes, and improves the table alignment, and the 1 requests a refresh every second. Useful to investigate CPU starvation issues.

If you're concerned about CPU starvation, pay particular attention to the r column, which shows how many active Runnable threads the system is servicing at the time. If this exceeds the number of available CPUs, then the system is likely under strain.
If you're on a virtualized environment (such as VMWare or a cloud like AWS), check the st column, which shows what percentage of CPU time the host machine is servicing other guest virtual machines
If you're concerned about disk I/O, check the wa column, which is a measure of percentage time the CPU threads were idle blocked on disk I/O
If you're concerned about memory, check the swap si and so columns. If both columns are constantly non-zero, the system is likely under memory pressure. See also Linux PSI, and pidstat Memory view below.

➜ vmstat -SM 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 9  0      0  24064    100   3573    0    0   111    32 1274  509  3  1 96  0  0
 3  0      0  24064    100   3573    0    0     0     0 70905 143722  2  1 96  0  0
 4  0      0  24064    100   3573    0    0     0   144 73893 148005  2  1 97  0  0
 4  0      0  24064    100   3573    0    0     0    20 74762 148884  2  2 96  0  0
 8  0      0  24064    100   3573    0    0     0     0 74553 148473  2  2 96  0  0
 2  0      0  24064    100   3573    0    0     0     0 72295 146096  2  1 96  0  0
 5  0      0  24064    100   3573    0    0     0     0 74562 148984  2  1 97  0  0

Note that vmstat is not very good at keeping columns aligned - be careful when reading it.

pidstat¶

pidstat allows us to review high level and thread level stats of a process. Specific views are available which assist in investigating CPU, Memory, and I/O issues amongst others.

CPU view¶

A sample when looking at a C Media Driver, showing the CPU view at the thread level (-t) for PID 5890 (-p 5890):

➜ pidstat -t -p 5890
Linux 5.8.0-44-generic (hurricane-ubuntu)   03/11/2021  _x86_64_    (32 CPU)

08:13:39 AM   UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
08:13:39 AM  1000      5890         -    0.13    0.16    0.00    0.00    0.29     0  aeronmd
08:13:39 AM  1000         -      5890    0.05    0.05    0.00    0.00    0.10     0  |__aeronmd
08:13:39 AM  1000         -      5891    0.04    0.06    0.00    0.01    0.09    21  |__sender
08:13:39 AM  1000         -      5892    0.04    0.06    0.00    0.01    0.09    21  |__receiver

Key values:

%usr: this is the percentage of time spent running user/application code by the thread
%system: this is the percentage of time spent running Linux kernel code by the thread
%wait: this is the pecentage of time the CPU spent waiting by the thread
%CPU: this is the total percentage utilization of the CPU by the thread
CPU: this is the CPU (processor number) the thread is running on. In the sample above, Linux has scheduled both the sender and receiver on the same CPU - which might not be ideal for all scenarios.

I/O view¶

Here's the I/O delay view (-d) in a Java Media Driver embedded in the IPC Sample at the thread level (-t) for PID 22378 (-p 22378):

➜ pidstat -t -d -p 22378
Linux 5.8.0-44-generic (hurricane-ubuntu)   03/11/2021  _x86_64_    (32 CPU)

09:35:39 AM   UID      TGID       TID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
09:35:39 AM  1000     22378         -      0.00      0.01      0.00       0  java
...java compiler, GC etc. threads ommitted
09:35:39 AM  1000         -     22406      0.00      0.00      0.00       0  |__sender
09:35:39 AM  1000         -     22407      0.00      0.00      0.00       0  |__receiver
09:35:39 AM  1000         -     22408      0.00      0.00      0.00       0  |__aeron-client-co
09:35:39 AM  1000         -     22409      0.00      0.00      0.00       0  |__send-agent
09:35:39 AM  1000         -     22410      0.00      0.00      0.00       0  |__receive-agent

Key values:

kB_rd/s: Kilobytes read per second for the thread
kB_wd/s: Kilobytes issued for write per second for the thread
kB_ccwr/s: Kilobytes canceled for write per second for the thread
iodelay: The time the process was blocked on disk I/O - including swapping - for the thread. This is measured in clock ticks - high values here should be investigated further.

Memory view¶

Here's the Memory view (-r) in a Java Media Driver embedded in the Cluster Sample at the thread level (-t) for PID 37663 (-p 37663):

➜  pidstat -t -r -p 37663
Linux 5.8.0-44-generic (hurricane-ubuntu)   03/11/2021  _x86_64_    (32 CPU)

10:51:43 AM   UID      TGID       TID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
10:51:43 AM  1000     37663         -      2.13      0.00 12596716  100760   0.31  java
...java compiler, GC etc. threads ommitted
10:51:43 AM  1000         -     37691      0.07      0.00 12596716  100760   0.31  |__[sender,receive
10:51:43 AM  1000         -     37692      0.01      0.00 12596716  100760   0.31  |__archive-conduct
10:51:43 AM  1000         -     37693      0.03      0.00 12596716  100760   0.31  |__consensus-modul
10:51:43 AM  1000         -     37694      0.00      0.00 12596716  100760   0.31  |__aeron-client-co
10:51:43 AM  1000         -     37695      0.01      0.00 12596716  100760   0.31  |__clustered-servi

Note that in this sample, the Media Driver sender, receiver and driver conductor are running on a shared thread. The clustered service holds the application code.

Key values:

minflt/s: Minor faults per second for the thread (faults which have not required loading a memory page from disk)
majflt/s: Major faults per second for the thread (faults which required loading a memory page from disk). High values here should be investigated further (does the machine have enough memory, etc.)
VSZ: this is the virtual memory size of the entire task in kb
RSS: this is the resident set size of the entire task in kb

Linux PSI¶

If the system is suspected to be under memory pressure (e.g. vmstat shows constant values under so and si), then this can be confirmed by reviewing Linux pressure stall information. This is reviewed via cat /proc/pressure/memory, and provides 10-second, 1 minute and 10 minute averages for the percentage of time processes were stalled. The some value shows the time that some threads or tasks were affected and the full value shows when all threads or tasks were affected.

➜ cat /proc/pressure/memory  
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

If waits are high (especially for full), then memory issues need to be investigated.

jstack¶

This is a standard Java tool, and can be useful to understand exactly what each thread is doing. In the sample below, we can see the dedicated media driver threads embedded in the IPC Sample. Note that standard Java threads have been omitted in the sample output.

Threads of particular interest:

"driver-conductor": this is the Media Driver driver conductor thread, and is currently parked due to the SleepingIdleStrategy
"sender": this is the Media Driver sender thread, and is currently parked due to the SleepingIdleStrategy
"receiver": this is the Media Driver receiver thread, and is currently parked due to the SleepingIdleStrategy
"aeron-client-conductor": this is the Aeron Client conductor thread, and is currently parked due to the SleepingMillisIdleStrategy
"archive-conductor": found with Aeron Archive and Aeron Cluster (not shown below).
"consensus-module": found with Aeron Cluster (not shown below).
"clustered-service": found with Aeron Cluster (not shown below).

This tool can be useful to diagnose some issues. For example, if you see io.aeron.driver.buffer.MappedRawLog in the stack often, you may be suffering from slow disk or memory I/O. If you see a child of io.aeron.driver.NameResolver, you may be suffering from slow name resolution.

If you have a Java core dump, you can use jstack to produce the same view as below.

As an alternate to running jstack on demand, you could enable JFR (Java Flight Recorder) recording, and review the thread dumps within a JFR recording file.

➜ jstack 9030
2021-03-11 08:23:04
Full thread dump OpenJDK 64-Bit Server VM (15.0.1+8 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00007f6440001790, length=17, elements={
0x00007f6498028a30, 0x00007f6498178660, 0x00007f6498179da0, 0x00007f649817f5c0,
0x00007f6498180cd0, 0x00007f64981829b0, 0x00007f64981841c0, 0x00007f64981858c0,
0x00007f64981bae40, 0x00007f64981be900, 0x00007f649842a0a0, 0x00007f649842b000,
0x00007f649842c270, 0x00007f64984378d0, 0x00007f6498443930, 0x00007f6498440e30,
0x00007f6440000d20
}

"main" #1 prio=5 os_prio=0 cpu=339.72ms elapsed=66.34s tid=0x00007f6498028a30 nid=0x2348 waiting on condition  [0x00007f649e1cd000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@15.0.1/Native Method)
    - parking to wait for  <0x00000006297c4318> (a java.util.concurrent.CountDownLatch$Sync)
    at java.util.concurrent.locks.LockSupport.park(java.base@15.0.1/LockSupport.java:211)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@15.0.1/AbstractQueuedSynchronizer.java:714)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@15.0.1/AbstractQueuedSynchronizer.java:1046)
    at java.util.concurrent.CountDownLatch.await(java.base@15.0.1/CountDownLatch.java:232)
    at org.agrona.concurrent.ShutdownSignalBarrier.await(ShutdownSignalBarrier.java:98)
    at com.aeroncookbook.ipc.agents.StartHere.main(StartHere.java:76)

"driver-conductor" #22 prio=5 os_prio=0 cpu=18051.12ms elapsed=66.01s tid=0x00007f649842a0a0 nid=0x235e runnable  [0x00007f645c923000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@15.0.1/Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@15.0.1/LockSupport.java:376)
    at org.agrona.concurrent.SleepingIdleStrategy.idle(SleepingIdleStrategy.java:67)
    at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:292)
    at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:164)
    at java.lang.Thread.run(java.base@15.0.1/Thread.java:832)

"sender" #23 prio=5 os_prio=0 cpu=17731.61ms elapsed=66.01s tid=0x00007f649842b000 nid=0x235f runnable  [0x00007f645c822000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@15.0.1/Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@15.0.1/LockSupport.java:376)
    at org.agrona.concurrent.SleepingIdleStrategy.idle(SleepingIdleStrategy.java:67)
    at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:292)
    at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:164)
    at java.lang.Thread.run(java.base@15.0.1/Thread.java:832)

"receiver" #24 prio=5 os_prio=0 cpu=17659.60ms elapsed=66.01s tid=0x00007f649842c270 nid=0x2360 runnable  [0x00007f645c721000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@15.0.1/Native Method)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@15.0.1/LockSupport.java:376)
    at org.agrona.concurrent.SleepingIdleStrategy.idle(SleepingIdleStrategy.java:67)
    at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:292)
    at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:164)
    at java.lang.Thread.run(java.base@15.0.1/Thread.java:832)

"aeron-client-conductor" #25 prio=5 os_prio=0 cpu=170.67ms elapsed=66.00s tid=0x00007f64984378d0 nid=0x2361 waiting on condition  [0x00007f645c620000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(java.base@15.0.1/Native Method)
    at org.agrona.concurrent.SleepingMillisIdleStrategy.idle(SleepingMillisIdleStrategy.java:67)
    at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:292)
    at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:164)
    at java.lang.Thread.run(java.base@15.0.1/Thread.java:832)

mpstat¶

mpstat can provide whole system level CPU diagnostics, including NUMA node information. This can be useful to diagnose higher level problems such as a CPU hotspot. Useful to investigate CPU starvation issues.

➜  ~ mpstat -P ALL
Linux 5.8.0-44-generic (hurricane-ubuntu)   03/11/2021  _x86_64_    (32 CPU)

11:03:38 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:03:38 AM  all    1.06    0.00    0.35    0.00    0.00    0.00    0.00    0.00    0.00   98.58
11:03:38 AM    0    1.08    0.01    0.32    0.01    0.00    0.03    0.00    0.00    0.00   98.56
11:03:38 AM    1    0.96    0.00    0.34    0.00    0.00    0.01    0.00    0.00    0.00   98.69
11:03:38 AM    2    0.97    0.01    0.34    0.00    0.00    0.00    0.00    0.00    0.00   98.67
11:03:38 AM    3    1.15    0.00    0.39    0.02    0.00    0.02    0.00    0.00    0.00   98.42
11:03:38 AM    4    1.02    0.00    0.39    0.00    0.00    0.00    0.00    0.00    0.00   98.59
11:03:38 AM    5    1.06    0.00    0.38    0.00    0.00    0.00    0.00    0.00    0.00   98.56
11:03:38 AM    6    1.26    0.00    0.37    0.00    0.00    0.00    0.00    0.00    0.00   98.37
...

Key values:

CPU: The CPU the stats are for. all is the averages view for all processors. This matches up to the processor number as seen in pidstat CPU view above
%iowait: this is the percentage of time that the CPU was idle while the system was awaiting a disk I/O request
%irq: this is the percentage of time spent by the CPU on servicing hardware interrupts
%soft: this is the percentage of time spent by the CPU on servicing software interrupts
%steal: this is the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
%idle: this is the percentage of time that the CPU was idle and the system did not have an outstanding disk I/O request

iostat¶

iostat shows information on the CPU and Device I/O at the system level. The -sxz view is especially useful to investigate disk I/O issues.

➜  iostat -sxz
Linux 5.8.0-44-generic (hurricane-ubuntu)   03/11/2021  _x86_64_    (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.04    0.00    0.36    0.00    0.00   98.59

Device             tps      kB/s    rqm/s   await  areq-sz  aqu-sz  %util
loop0             0.01      0.10     0.00    0.25    17.48    0.00   0.00
...
sda              10.34    392.05     5.54    0.49    37.91    0.01   0.60
scd0              0.00      0.00     0.00    0.36     0.23    0.00   0.00

Key values:

tps: indicates the number of transfers per second that were issued to the device. Multiple logical I/O requests can be combined into a single I/O request to the device. This makes no indication of size.
kB/s: The number of kilobytes read from, written to or discarded for the device per second.
rqm/s: the number of read+write I/O requests per second that were queued to the device
await: the average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
areq-sz: the average size (in kilobytes) of the I/O requests that were issued to the device.
aqu-sz: the average queue length of the requests that were issued to the device
%util: percentage of time during which I/O requests were currently in progress on the device when checked by the kernel (i.e. utilization for the device).

sar¶

sar exposes a whole host of Linux kernel counters, many of which are also exposed with the tools above.

Run Queue & Blocked Threads¶

This provides a system level overview of CPU run queue lengths and blocked threads. Useful to investigate CPU starvation issues.

➜  ~ sar -q 1
Linux 5.8.0-44-generic (hurricane-ubuntu)   03/11/2021  _x86_64_    (32 CPU)

03:16:33 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
03:16:34 PM         0       806      1.26      0.35      0.12         0
03:16:35 PM         0       806      1.26      0.35      0.12         0
03:16:36 PM         1       805      1.15      0.35      0.12         0
...

Key values:

runq-sz: run queue length (number of tasks waiting for run time)
plist-sz: the number of tasks in the task list.
blocked: the number of tasks currently blocked, waiting for I/O to complete.

A process level view is available in /proc/PID/schedstat - in particular the second field. This is the total time, in nanoseconds, spent on the run queue. In the sample below, there was a delay of 58ms for the PID 2074.

➜ cat /proc/2074/schedstat
12300734265 58729796 71519

UDP View¶

This provides a system level overview of UDP statistics. Useful to investigate network I/O issues.

➜  ~ sar -n UDP 1  
Linux 5.8.0-44-generic (hurricane-ubuntu)   03/11/2021  _x86_64_    (32 CPU)

11:25:53 AM    idgm/s    odgm/s  noport/s idgmerr/s
11:25:54 AM      0.00      0.00      0.00      0.00
11:25:55 AM      0.00      0.00      0.00      0.00
11:25:56 AM      0.00      0.00      0.00      0.00
11:25:57 AM      0.00      0.00      0.00      0.00

Key values:

idgm/s: the total number of UDP datagrams delivered per second to UDP users
odgm/s: the total number of UDP datagrams sent per second from this entity
noport/s: the total number of received UDP datagrams per second for which there was no application at the destination port
idgmerr/s: the number of received UDP datagrams per second that could not be delivered for reasons other than the lack of an application at the destination port