Hi,
We’ve gotten several coredumps recently on a couple of the boxes in our 6 node cluster. Here’s the backtrace:
#0 0x00a807a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00154825 in raise () from /lib/tls/libc.so.6
#2 0x00156289 in abort () from /lib/tls/libc.so.6
#3 0x0014dda1 in __assert_fail () from /lib/tls/libc.so.6
#4 0x00d65268 in totemsrp_callback_token_destroy () from /usr/lib/libtotem_pg.so.4
#5 0x00d6a3ef in main_deliver_fn () from /usr/lib/libtotem_pg.so.4
#6 0x00d5cea8 in totemudpu_member_remove () from /usr/lib/libtotem_pg.so.4
#7 0x00d5e582 in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.4
#8 0x00000000 in ?? ()
From looking at the source code, it appears that it is trying to delete an item from a linked list when it fails.
There are several other things going on as well.
The logs show a continuous stream of messages that a processor has joined or left the membership. There are 10's of thousands of these lines:
Mar 04 10:35:14 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 04 10:35:14 corosync [CPG ] chosen downlist: sender r(0) ip(10.22.163.116) ; members(old:5 left:0)
Mar 04 10:35:14 corosync [MAIN ] Completed service synchronization, ready to provide service.
Mar 04 10:35:14 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 2208116: memb=5, new=0, lost=0
Mar 04 10:35:14 corosync [pcmk ] info: pcmk_peer_update: memb: intersp2-admin1 1956845066
Mar 04 10:35:14 corosync [pcmk ] info: pcmk_peer_update: memb: intersp2-engine1 1990399498
The coredumps are captured by these lines in the logs:
Mar 04 10:35:14 intersp2-db2 crmd: [24013]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11)
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11)
Mar 04 10:35:14 intersp2-db2 cib: [24009]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11)
Mar 04 10:35:14 intersp2-db2 crmd: [24013]: ERROR: ais_dispatch: AIS connection failed
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: ERROR: ais_dispatch: AIS connection failed
Mar 04 10:35:14 intersp2-db2 cib: [24009]: ERROR: ais_dispatch: AIS connection failed
Mar 04 10:35:14 intersp2-db2 crmd: [24013]: ERROR: crm_ais_destroy: AIS connection terminated
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: CRIT: attrd_ais_destroy: Lost connection to OpenAIS service!
Mar 04 10:35:14 intersp2-db2 cib: [24009]: ERROR: cib_ais_destroy: AIS connection terminated
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: info: main: Exiting...
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: ERROR: attrd_cib_connection_destroy: Connection to the CIB terminated...
Mar 04 10:35:14 intersp2-db2 stonithd: [24008]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Success (0)
Mar 04 10:35:14 intersp2-db2 stonithd: [24008]: ERROR: ais_dispatch: AIS connection failed
Mar 04 10:35:14 intersp2-db2 stonithd: [24008]: ERROR: AIS connection terminated
Mar 04 10:35:21 corosync [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
On one of the boxes that had coredumped several times, /dev/shm was completely full with files named dispatch*, response*, request*, control*.
I also found that one box that should have been in the cluster was not able to add itself. I'm not sure why not.
All 5 of the boxes that were in the cluster coredumped at least once with the same stack trace above.
Any ideas as to what is going on here?
-gm