Hello, we got a report from a client running simple 2-node cluster on a couple of Ubuntu 14.04 EC2 instances. Cluster is based on corosync (2.3.3) and pacemaker (1.1.10). At a certain point corosync processes start to eat nearly 95% CPU without any clear reason: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25313 root 20 0 195916 76056 67976 R 94.3 2.0 60448:49 corosync PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16874 root 20 0 197976 76116 68016 R 94.2 2.0 59755:46 corosync We were trying to get the output of the "corosync-cmapctl" tool while it has been happening, but had no luck (command just hangs right after start, able to stop by Ctrl+C). There was no traffic on 5405/udp on both nodes (checked with tcpdump). Meanwhile, "crm status" was reporting that cluster is alive and both nodes are online. We had a chance to get cores from running corosync processes on both nodes, and also collected the "perf" data. Here's the backtrace and "perf report" output: (gdb) bt #0 0x00007ffa65479870 in qb_rb_space_free () from /usr/lib/libqb.so.0 #1 0x00007ffa65479a70 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 #2 0x00007ffa65483a5b in ?? () from /usr/lib/libqb.so.0 #3 0x00007ffa65481db4 in qb_log_real_va_ () from /usr/lib/libqb.so.0 #4 0x00007ffa65481fbc in qb_log_real_ () from /usr/lib/libqb.so.0 #5 0x00007ffa65d40ece in joinlist_inform_clients () at cpg.c:941 #6 cpg_sync_activate () at cpg.c:585 #7 0x00007ffa65d390fb in sync_barrier_handler (msg=0x7ffa67f9e80e, nodeid=<optimized out>) at sync.c:236 #8 sync_deliver_fn (nodeid=<optimized out>, msg=0x7ffa67f9e80e, msg_len=<optimized out>, endian_conversion_required=<optimized out>) at sync.c:377 #9 0x00007ffa658ef265 in app_deliver_fn (endian_conversion_required=0, msg_len=56, msg=0x7ffa67f9e806, nodeid=15875653) at totempg.c:591 #10 totempg_deliver_fn (nodeid=15875653, msg=0x7ffa67da28e3, msg_len=<optimized out>, endian_conversion_required=0) at totempg.c:701 #11 0x00007ffa658e551e in messages_deliver_to_app (instance=instance@entry=0x7ffa65cd5010, end_point=<optimized out>, skip=0) at totemsrp.c:3975 #12 0x00007ffa658e8bdc in message_handler_mcast (instance=0x7ffa65cd5010, msg=<optimized out>, msg_len=1184, endian_conversion_needed=<optimized out>) at totemsrp.c:4093 #13 0x00007ffa658e44e2 in rrp_deliver_fn (context=0x7ffa67642440, msg=0x7ffa676426f8, msg_len=1184) at totemrrp.c:1794 #14 0x00007ffa658e0dee in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0x7ffa67642690) at totemudpu.c:468 #15 0x00007ffa6547b21f in ?? () from /usr/lib/libqb.so.0 #16 0x00007ffa6547ae00 in qb_loop_run () from /usr/lib/libqb.so.0 #17 0x00007ffa65d33c30 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1314 # Samples: 33K of event 'cpu-clock' # Event count (approx.): 8473500000 # # Overhead Command Shared Object Symbol # ........ ........ ................. ........................ # 69.63% corosync libqb.so.0.16.0 [.] 0x0000000000009fea 22.76% corosync libqb.so.0.16.0 [.] qb_rb_space_free 7.57% corosync libqb.so.0.16.0 [.] qb_rb_chunk_alloc 0.04% corosync libqb.so.0.16.0 [.] qb_rb_space_free@plt 0.00% corosync [kernel.kallsyms] [k] 0xffffffff8100122a Backtrace/perf from the second node shows nearly the same path. The libqb package version is 0.16.0.real-1ubuntu3. Unfortunately, we were not able to find debug symbols for that package. Please let us know if any other information would help to identify the reason(s) of the issue (configuration, logs, corefile/perf.data, etc.) Any help will be greatly appreciated. PS: please let me know if this is worth to cross-posting to libqb and/or pacemaker mailing lists as well. _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss