corosync 1.4.1-1 coredump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
We have a 6 box cluster running with corosync 1.4.1-1.  Each box supports a "maintenance mode" where it is isolated from the other boxes by using the firewall to block it's communications.
 
When we put a box in maintenance mode, we get these messages in corosync.log:
 
Jan 08 06:08:07 corosync [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
 
as well as messages like these:
 
Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Child 10942 spawned to record non-fatal assertion failure line 1591: rc == 0
Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Message not sent (-1): <create_request_adv origin="do_election_vote" t="crmd" version="3.0.1" subt="request" reference="vote-crmd-1326002888-12
Jan 08 06:08:08 corosync [pcmk  ] WARN: route_ais_message: Sending message to <all>.crmd failed: cluster delivery failed (rc=-1)
Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Child 10943 spawned to record non-fatal assertion failure line 1591: rc == 0
Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Message not sent (-1): <create_request_adv origin="join_make_offer" t="crmd" version="3.0.1" subt="request" reference="join_offer-dc-1326002888
Jan 08 06:08:08 corosync [pcmk  ] WARN: route_ais_message: Sending message to agile4-ctx1-db1.crmd failed: cluster delivery failed (rc=-1)
 
Note the messages from send_cluster_msg_raw.  Every so often we get a coredump.  Here is the stack:
 
#0  0x00ee87a2 in _dl_sysinfo_int80 ()
   from /lib/ld-linux.so.2
(gdb) bt
#0  0x00ee87a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00138825 in raise () from /lib/tls/libc.so.6
#2  0x0013a289 in abort () from /lib/tls/libc.so.6
#3  0x005f296b in send_cluster_msg_raw () from /usr/libexec/lcrso/pacemaker.lcrso
#4  0x005f2510 in route_ais_message () from /usr/libexec/lcrso/pacemaker.lcrso
#5  0x005f0759 in pcmk_ipc () from /usr/libexec/lcrso/pacemaker.lcrso
#6  0x00c7b269 in coroipcs_response_iov_send () from /usr/lib/libcoroipcs.so.4
#7  0x00b163cc in start_thread () from /lib/tls/libpthread.so.0
#8  0x001dcf0e in clone () from /lib/tls/libc.so.6
 
as the stack shows, send_cluster_msg_raw() is calling abort. 
 
I looked at the code for send_cluster_msg_raw in the pacemaker code (plugin.c).  AIS_ASSERT calls abort, so it looks like one of these 2 lines is aborting:
 
    AIS_ASSERT(local_nodeid != 0);
    AIS_ASSERT(ais_msg->header.size == (sizeof(AIS_Message) + ais_data_len(ais_msg)));
We considered stopping corosync while in maintenance mode, but one of our nodes will shutdown if corosync is not running, so that is not an option for us. 
 
Is there a way to keep corosync running but not doing anything?  Of course when we leave maintenance mode, we want corosync to start sending messages again.  Any other ideas on how to handle this?
-gm
 
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux