On Thu, Feb 2, 2012 at 6:24 AM, Grant Martin (granmart) <granmart@xxxxxxxxx> wrote: > Hi, > We have a 6 box cluster running with corosync 1.4.1-1. Each box supports a > "maintenance mode" where it is isolated from the other boxes by using the > firewall to block it's communications. > > When we put a box in maintenance mode, we get these messages in > corosync.log: > > Jan 08 06:08:07 corosync [TOTEM ] Totem is unable to form a cluster because > of an operating system or network fault. The most common cause of this > message is that the local firewall is configured improperly. > > as well as messages like these: > > Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Child 10942 > spawned to record non-fatal assertion failure line 1591: rc == 0 > Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Message not > sent (-1): <create_request_adv origin="do_election_vote" t="crmd" > version="3.0.1" subt="request" reference="vote-crmd-1326002888-12 > Jan 08 06:08:08 corosync [pcmk ] WARN: route_ais_message: Sending message > to <all>.crmd failed: cluster delivery failed (rc=-1) > Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Child 10943 > spawned to record non-fatal assertion failure line 1591: rc == 0 > Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Message not > sent (-1): <create_request_adv origin="join_make_offer" t="crmd" > version="3.0.1" subt="request" reference="join_offer-dc-1326002888 > Jan 08 06:08:08 corosync [pcmk ] WARN: route_ais_message: Sending message > to agile4-ctx1-db1.crmd failed: cluster delivery failed (rc=-1) > > Note the messages from send_cluster_msg_raw. Every so often we get a > coredump. Here is the stack: > > #0 0x00ee87a2 in _dl_sysinfo_int80 () > from /lib/ld-linux.so.2 > (gdb) bt > #0 0x00ee87a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 > #1 0x00138825 in raise () from /lib/tls/libc.so.6 > #2 0x0013a289 in abort () from /lib/tls/libc.so.6 > #3 0x005f296b in send_cluster_msg_raw () from > /usr/libexec/lcrso/pacemaker.lcrso > #4 0x005f2510 in route_ais_message () from > /usr/libexec/lcrso/pacemaker.lcrso > #5 0x005f0759 in pcmk_ipc () from /usr/libexec/lcrso/pacemaker.lcrso > #6 0x00c7b269 in coroipcs_response_iov_send () from > /usr/lib/libcoroipcs.so.4 > #7 0x00b163cc in start_thread () from /lib/tls/libpthread.so.0 > #8 0x001dcf0e in clone () from /lib/tls/libc.so.6 > > as the stack shows, send_cluster_msg_raw() is calling abort. > > I looked at the code for send_cluster_msg_raw in the pacemaker code > (plugin.c). AIS_ASSERT calls abort, so it looks like one of these 2 lines > is aborting: > > AIS_ASSERT(local_nodeid != 0); > AIS_ASSERT(ais_msg->header.size == (sizeof(AIS_Message) + > ais_data_len(ais_msg))); Actually you're hitting: AIS_CHECK(rc == 0, ais_err("Message not sent (%d): %.120s", rc, mutable->data)); Which is line 1591 of the file containing send_cluster_msg_raw, hence: "Child 10942 spawned to record non-fatal assertion failure line 1591: rc == 0" Very strange, that means that the call: rc = pcmk_api->totem_mcast(&iovec, 1, TOTEMPG_SAFE); failed. Steve might have some more ideas as to why that would happen. > We considered stopping corosync while in maintenance mode, but one of our > nodes will shutdown if corosync is not running, so that is not an option for > us. > > Is there a way to keep corosync running but not doing anything? Of course > when we leave maintenance mode, we want corosync to start sending messages > again. Any other ideas on how to handle this? In this case, not loading the pacemaker side of things would be enough. > -gm > > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss