Re: corosync/libqb CPU hog

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Thu, 26 Feb 2015 10:00:09 +0000



On 25/02/15 07:00, Andrei Belov wrote:
> Hello,
> 
> we got a report from a client running simple 2-node cluster on
> a couple of Ubuntu 14.04 EC2 instances. Cluster is based on corosync
> (2.3.3) and pacemaker (1.1.10). At a certain point corosync processes
> start to eat nearly 95% CPU without any clear reason:
> 
>   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                             
> 25313 root      20   0  195916  76056  67976 R 94.3  2.0  60448:49 corosync     
> 
>   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                             
> 16874 root      20   0  197976  76116  68016 R 94.2  2.0  59755:46 corosync      
> 
> We were trying to get the output of the "corosync-cmapctl" tool
> while it has been happening, but had no luck (command just hangs
> right after start, able to stop by Ctrl+C).
> 
> There was no traffic on 5405/udp on both nodes (checked with tcpdump).
> 
> Meanwhile, "crm status" was reporting that cluster is alive and both nodes
> are online.
> 
> We had a chance to get cores from running corosync processes on both nodes,
> and also collected the "perf" data. Here's the backtrace and "perf report"
> output:
> 
> (gdb) bt
> #0  0x00007ffa65479870 in qb_rb_space_free () from /usr/lib/libqb.so.0
> #1  0x00007ffa65479a70 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0
> #2  0x00007ffa65483a5b in ?? () from /usr/lib/libqb.so.0
> #3  0x00007ffa65481db4 in qb_log_real_va_ () from /usr/lib/libqb.so.0
> #4  0x00007ffa65481fbc in qb_log_real_ () from /usr/lib/libqb.so.0
> #5  0x00007ffa65d40ece in joinlist_inform_clients () at cpg.c:941
> #6  cpg_sync_activate () at cpg.c:585
> #7  0x00007ffa65d390fb in sync_barrier_handler (msg=0x7ffa67f9e80e, nodeid=<optimized out>) at sync.c:236
> #8  sync_deliver_fn (nodeid=<optimized out>, msg=0x7ffa67f9e80e, msg_len=<optimized out>, endian_conversion_required=<optimized out>) at sync.c:377
> #9  0x00007ffa658ef265 in app_deliver_fn (endian_conversion_required=0, msg_len=56, msg=0x7ffa67f9e806, nodeid=15875653) at totempg.c:591
> #10 totempg_deliver_fn (nodeid=15875653, msg=0x7ffa67da28e3, msg_len=<optimized out>, endian_conversion_required=0) at totempg.c:701
> #11 0x00007ffa658e551e in messages_deliver_to_app (instance=instance@entry=0x7ffa65cd5010, end_point=<optimized out>, skip=0) at totemsrp.c:3975
> #12 0x00007ffa658e8bdc in message_handler_mcast (instance=0x7ffa65cd5010, msg=<optimized out>, msg_len=1184, endian_conversion_needed=<optimized out>)
>     at totemsrp.c:4093
> #13 0x00007ffa658e44e2 in rrp_deliver_fn (context=0x7ffa67642440, msg=0x7ffa676426f8, msg_len=1184) at totemrrp.c:1794
> #14 0x00007ffa658e0dee in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7ffa67642690) at totemudpu.c:468
> #15 0x00007ffa6547b21f in ?? () from /usr/lib/libqb.so.0
> #16 0x00007ffa6547ae00 in qb_loop_run () from /usr/lib/libqb.so.0
> #17 0x00007ffa65d33c30 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1314
> 
> # Samples: 33K of event 'cpu-clock'
> # Event count (approx.): 8473500000
> #
> # Overhead   Command      Shared Object                    Symbol
> # ........  ........  .................  ........................
> #
>     69.63%  corosync  libqb.so.0.16.0    [.] 0x0000000000009fea  
>     22.76%  corosync  libqb.so.0.16.0    [.] qb_rb_space_free    
>      7.57%  corosync  libqb.so.0.16.0    [.] qb_rb_chunk_alloc   
>      0.04%  corosync  libqb.so.0.16.0    [.] qb_rb_space_free@plt
>      0.00%  corosync  [kernel.kallsyms]  [k] 0xffffffff8100122a  
> 
> Backtrace/perf from the second node shows nearly the same path.
> 
> The libqb package version is 0.16.0.real-1ubuntu3. Unfortunately,
> we were not able to find debug symbols for that package.
> 
> Please let us know if any other information would help to identify
> the reason(s) of the issue (configuration, logs, corefile/perf.data,
> etc.)
> 
> Any help will be greatly appreciated.
> 
> 
>


I suspect you might be running into this bug:
https://github.com/ClusterLabs/libqb/issues/139


The fix is here:
https://github.com/ClusterLabs/libqb/pull/141

Chrissie
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss