On 02/01/2012 12:24 PM, Grant Martin (granmart) wrote: > Hi, > We have a 6 box cluster running with corosync 1.4.1-1. Each box > supports a "maintenance mode" where it is isolated from the other > boxes by using the firewall to block it's communications. > > When we put a box in maintenance mode, we get these messages in > corosync.log: > > Jan 08 06:08:07 corosync [TOTEM ] Totem is unable to form a cluster > because of an operating system or network fault. The most common cause > of this message is that the local firewall is configured improperly. > > as well as messages like these: > > Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Child > 10942 spawned to record non-fatal assertion failure line 1591: rc == 0 > Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Message > not sent (-1): <create_request_adv origin="do_election_vote" t="crmd" > version="3.0.1" subt="request" reference="vote-crmd-1326002888-12 > Jan 08 06:08:08 corosync [pcmk ] WARN: route_ais_message: Sending > message to <all>.crmd failed: cluster delivery failed (rc=-1) > Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Child > 10943 spawned to record non-fatal assertion failure line 1591: rc == 0 > Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Message > not sent (-1): <create_request_adv origin="join_make_offer" t="crmd" > version="3.0.1" subt="request" reference="join_offer-dc-1326002888 > Jan 08 06:08:08 corosync [pcmk ] WARN: route_ais_message: Sending > message to agile4-ctx1-db1.crmd failed: cluster delivery failed (rc=-1) > > Note the messages from send_cluster_msg_raw. Every so often we get a > coredump. Here is the stack: > > #0 0x00ee87a2 in _dl_sysinfo_int80 () > from /lib/ld-linux.so.2 > (gdb) bt > #0 0x00ee87a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 > #1 0x00138825 in raise () from /lib/tls/libc.so.6 > #2 0x0013a289 in abort () from /lib/tls/libc.so.6 > #3 0x005f296b in send_cluster_msg_raw () from > /usr/libexec/lcrso/pacemaker.lcrso > #4 0x005f2510 in route_ais_message () from > /usr/libexec/lcrso/pacemaker.lcrso > #5 0x005f0759 in pcmk_ipc () from /usr/libexec/lcrso/pacemaker.lcrso > #6 0x00c7b269 in coroipcs_response_iov_send () from > /usr/lib/libcoroipcs.so.4 > #7 0x00b163cc in start_thread () from /lib/tls/libpthread.so.0 > #8 0x001dcf0e in clone () from /lib/tls/libc.so.6 > > as the stack shows, send_cluster_msg_raw() is calling abort. > Could you please install pacemaker-debuginfo and corosync-debuginfo and generate the backtrace again? > I looked at the code for send_cluster_msg_raw in the pacemaker code > (plugin.c). AIS_ASSERT calls abort, so it looks like one of these 2 > lines is aborting: > > AIS_ASSERT(local_nodeid != 0); > AIS_ASSERT(ais_msg->header.size == (sizeof(AIS_Message) + > ais_data_len(ais_msg))); > We considered stopping corosync while in maintenance mode, but one of > our nodes will shutdown if corosync is not running, so that is not an > option for us. > > Is there a way to keep corosync running but not doing anything? Of > course when we leave maintenance mode, we want corosync to start sending > messages again. Any other ideas on how to handle this? > -gm > > > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss