On 25/02/15 07:00, Andrei Belov wrote: > Hello, > > we got a report from a client running simple 2-node cluster on > a couple of Ubuntu 14.04 EC2 instances. Cluster is based on corosync > (2.3.3) and pacemaker (1.1.10). At a certain point corosync processes > start to eat nearly 95% CPU without any clear reason: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 25313 root 20 0 195916 76056 67976 R 94.3 2.0 60448:49 corosync > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 16874 root 20 0 197976 76116 68016 R 94.2 2.0 59755:46 corosync > > We were trying to get the output of the "corosync-cmapctl" tool > while it has been happening, but had no luck (command just hangs > right after start, able to stop by Ctrl+C). > > There was no traffic on 5405/udp on both nodes (checked with tcpdump). > > Meanwhile, "crm status" was reporting that cluster is alive and both nodes > are online. > > We had a chance to get cores from running corosync processes on both nodes, > and also collected the "perf" data. Here's the backtrace and "perf report" > output: > > (gdb) bt > #0 0x00007ffa65479870 in qb_rb_space_free () from /usr/lib/libqb.so.0 > #1 0x00007ffa65479a70 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 > #2 0x00007ffa65483a5b in ?? () from /usr/lib/libqb.so.0 > #3 0x00007ffa65481db4 in qb_log_real_va_ () from /usr/lib/libqb.so.0 > #4 0x00007ffa65481fbc in qb_log_real_ () from /usr/lib/libqb.so.0 > #5 0x00007ffa65d40ece in joinlist_inform_clients () at cpg.c:941 > #6 cpg_sync_activate () at cpg.c:585 > #7 0x00007ffa65d390fb in sync_barrier_handler (msg=0x7ffa67f9e80e, nodeid=<optimized out>) at sync.c:236 > #8 sync_deliver_fn (nodeid=<optimized out>, msg=0x7ffa67f9e80e, msg_len=<optimized out>, endian_conversion_required=<optimized out>) at sync.c:377 > #9 0x00007ffa658ef265 in app_deliver_fn (endian_conversion_required=0, msg_len=56, msg=0x7ffa67f9e806, nodeid=15875653) at totempg.c:591 > #10 totempg_deliver_fn (nodeid=15875653, msg=0x7ffa67da28e3, msg_len=<optimized out>, endian_conversion_required=0) at totempg.c:701 > #11 0x00007ffa658e551e in messages_deliver_to_app (instance=instance@entry=0x7ffa65cd5010, end_point=<optimized out>, skip=0) at totemsrp.c:3975 > #12 0x00007ffa658e8bdc in message_handler_mcast (instance=0x7ffa65cd5010, msg=<optimized out>, msg_len=1184, endian_conversion_needed=<optimized out>) > at totemsrp.c:4093 > #13 0x00007ffa658e44e2 in rrp_deliver_fn (context=0x7ffa67642440, msg=0x7ffa676426f8, msg_len=1184) at totemrrp.c:1794 > #14 0x00007ffa658e0dee in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7ffa67642690) at totemudpu.c:468 > #15 0x00007ffa6547b21f in ?? () from /usr/lib/libqb.so.0 > #16 0x00007ffa6547ae00 in qb_loop_run () from /usr/lib/libqb.so.0 > #17 0x00007ffa65d33c30 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1314 > > # Samples: 33K of event 'cpu-clock' > # Event count (approx.): 8473500000 > # > # Overhead Command Shared Object Symbol > # ........ ........ ................. ........................ > # > 69.63% corosync libqb.so.0.16.0 [.] 0x0000000000009fea > 22.76% corosync libqb.so.0.16.0 [.] qb_rb_space_free > 7.57% corosync libqb.so.0.16.0 [.] qb_rb_chunk_alloc > 0.04% corosync libqb.so.0.16.0 [.] qb_rb_space_free@plt > 0.00% corosync [kernel.kallsyms] [k] 0xffffffff8100122a > > Backtrace/perf from the second node shows nearly the same path. > > The libqb package version is 0.16.0.real-1ubuntu3. Unfortunately, > we were not able to find debug symbols for that package. > > Please let us know if any other information would help to identify > the reason(s) of the issue (configuration, logs, corefile/perf.data, > etc.) > > Any help will be greatly appreciated. > > > I suspect you might be running into this bug: https://github.com/ClusterLabs/libqb/issues/139 The fix is here: https://github.com/ClusterLabs/libqb/pull/141 Chrissie _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss