corosync/libqb CPU hog

Andrei Belov <defanator@xxxxxxxxx> · Wed, 25 Feb 2015 10:00:03 +0300

Hello,

we got a report from a client running simple 2-node cluster on
a couple of Ubuntu 14.04 EC2 instances. Cluster is based on corosync
(2.3.3) and pacemaker (1.1.10). At a certain point corosync processes
start to eat nearly 95% CPU without any clear reason:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                             
25313 root      20   0  195916  76056  67976 R 94.3  2.0  60448:49 corosync     

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                             
16874 root      20   0  197976  76116  68016 R 94.2  2.0  59755:46 corosync      

We were trying to get the output of the "corosync-cmapctl" tool
while it has been happening, but had no luck (command just hangs
right after start, able to stop by Ctrl+C).

There was no traffic on 5405/udp on both nodes (checked with tcpdump).

Meanwhile, "crm status" was reporting that cluster is alive and both nodes
are online.

We had a chance to get cores from running corosync processes on both nodes,
and also collected the "perf" data. Here's the backtrace and "perf report"
output:

(gdb) bt
#0  0x00007ffa65479870 in qb_rb_space_free () from /usr/lib/libqb.so.0
#1  0x00007ffa65479a70 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0
#2  0x00007ffa65483a5b in ?? () from /usr/lib/libqb.so.0
#3  0x00007ffa65481db4 in qb_log_real_va_ () from /usr/lib/libqb.so.0
#4  0x00007ffa65481fbc in qb_log_real_ () from /usr/lib/libqb.so.0
#5  0x00007ffa65d40ece in joinlist_inform_clients () at cpg.c:941
#6  cpg_sync_activate () at cpg.c:585
#7  0x00007ffa65d390fb in sync_barrier_handler (msg=0x7ffa67f9e80e, nodeid=<optimized out>) at sync.c:236
#8  sync_deliver_fn (nodeid=<optimized out>, msg=0x7ffa67f9e80e, msg_len=<optimized out>, endian_conversion_required=<optimized out>) at sync.c:377
#9  0x00007ffa658ef265 in app_deliver_fn (endian_conversion_required=0, msg_len=56, msg=0x7ffa67f9e806, nodeid=15875653) at totempg.c:591
#10 totempg_deliver_fn (nodeid=15875653, msg=0x7ffa67da28e3, msg_len=<optimized out>, endian_conversion_required=0) at totempg.c:701
#11 0x00007ffa658e551e in messages_deliver_to_app (instance=instance@entry=0x7ffa65cd5010, end_point=<optimized out>, skip=0) at totemsrp.c:3975
#12 0x00007ffa658e8bdc in message_handler_mcast (instance=0x7ffa65cd5010, msg=<optimized out>, msg_len=1184, endian_conversion_needed=<optimized out>)
    at totemsrp.c:4093
#13 0x00007ffa658e44e2 in rrp_deliver_fn (context=0x7ffa67642440, msg=0x7ffa676426f8, msg_len=1184) at totemrrp.c:1794
#14 0x00007ffa658e0dee in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7ffa67642690) at totemudpu.c:468
#15 0x00007ffa6547b21f in ?? () from /usr/lib/libqb.so.0
#16 0x00007ffa6547ae00 in qb_loop_run () from /usr/lib/libqb.so.0
#17 0x00007ffa65d33c30 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1314

# Samples: 33K of event 'cpu-clock'
# Event count (approx.): 8473500000
#
# Overhead   Command      Shared Object                    Symbol
# ........  ........  .................  ........................
#
    69.63%  corosync  libqb.so.0.16.0    [.] 0x0000000000009fea  
    22.76%  corosync  libqb.so.0.16.0    [.] qb_rb_space_free    
     7.57%  corosync  libqb.so.0.16.0    [.] qb_rb_chunk_alloc   
     0.04%  corosync  libqb.so.0.16.0    [.] qb_rb_space_free@plt
     0.00%  corosync  [kernel.kallsyms]  [k] 0xffffffff8100122a  

Backtrace/perf from the second node shows nearly the same path.

The libqb package version is 0.16.0.real-1ubuntu3. Unfortunately,
we were not able to find debug symbols for that package.

Please let us know if any other information would help to identify
the reason(s) of the issue (configuration, logs, corefile/perf.data,
etc.)

Any help will be greatly appreciated.

PS: please let me know if this is worth to cross-posting to
libqb and/or pacemaker mailing lists as well.

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss