corosync 1.4.1-1 coredumps and /dev/shm full

"Grant Martin (granmart)" <granmart@xxxxxxxxx> · Tue, 13 Mar 2012 16:45:01 -0700

Hi,
We’ve gotten several coredumps recently on a couple of the boxes in our 6 node cluster.  Here’s the backtrace:

#0  0x00a807a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00154825 in raise () from /lib/tls/libc.so.6
#2  0x00156289 in abort () from /lib/tls/libc.so.6
#3  0x0014dda1 in __assert_fail () from /lib/tls/libc.so.6
#4  0x00d65268 in totemsrp_callback_token_destroy () from /usr/lib/libtotem_pg.so.4
#5  0x00d6a3ef in main_deliver_fn () from /usr/lib/libtotem_pg.so.4
#6  0x00d5cea8 in totemudpu_member_remove () from /usr/lib/libtotem_pg.so.4
#7  0x00d5e582 in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.4
#8  0x00000000 in ?? ()

From looking at the source code, it appears that it is trying to delete an item from a linked list when it fails. 

There are several other things going on as well.

The logs show a continuous stream of messages that a processor has joined or left the membership. There are 10's of thousands of these lines:

Mar 04 10:35:14 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Mar 04 10:35:14 corosync [CPG  ] chosen downlist: sender r(0) ip(10.22.163.116) ; members(old:5 left:0)
Mar 04 10:35:14 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Mar 04 10:35:14 corosync [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 2208116: memb=5, new=0, lost=0
Mar 04 10:35:14 corosync [pcmk  ] info: pcmk_peer_update: memb: intersp2-admin1 1956845066
Mar 04 10:35:14 corosync [pcmk  ] info: pcmk_peer_update: memb: intersp2-engine1 1990399498

The coredumps are captured by these lines in the logs:

Mar 04 10:35:14 intersp2-db2 crmd: [24013]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11)
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11)
Mar 04 10:35:14 intersp2-db2 cib: [24009]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11)
Mar 04 10:35:14 intersp2-db2 crmd: [24013]: ERROR: ais_dispatch: AIS connection failed
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: ERROR: ais_dispatch: AIS connection failed
Mar 04 10:35:14 intersp2-db2 cib: [24009]: ERROR: ais_dispatch: AIS connection failed
Mar 04 10:35:14 intersp2-db2 crmd: [24013]: ERROR: crm_ais_destroy: AIS connection terminated
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: CRIT: attrd_ais_destroy: Lost connection to OpenAIS service!
Mar 04 10:35:14 intersp2-db2 cib: [24009]: ERROR: cib_ais_destroy: AIS connection terminated
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: info: main: Exiting...
Mar 04 10:35:14 intersp2-db2 attrd: [24011]: ERROR: attrd_cib_connection_destroy: Connection to the CIB terminated...
Mar 04 10:35:14 intersp2-db2 stonithd: [24008]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Success (0)
Mar 04 10:35:14 intersp2-db2 stonithd: [24008]: ERROR: ais_dispatch: AIS connection failed
Mar 04 10:35:14 intersp2-db2 stonithd: [24008]: ERROR: AIS connection terminated
Mar 04 10:35:21 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.

On one of the boxes that had coredumped several times, /dev/shm was completely full with files named dispatch*, response*, request*, control*.

I also found that one box that should have been in the cluster was not able to add itself.  I'm not sure why not.

All 5 of the boxes that were in the cluster coredumped at least once with the same stack trace above.

Any ideas as to what is going on here?
-gm

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss