On Thu, Feb 2, 2012 at 2:55 PM, Steven Dake <sdake@xxxxxxxxxx> wrote: > On 02/01/2012 06:39 PM, Andrew Beekhof wrote: >> On Thu, Feb 2, 2012 at 6:24 AM, Grant Martin (granmart) >> <granmart@xxxxxxxxx> wrote: >>> Hi, >>> We have a 6 box cluster running with corosync 1.4.1-1. Each box supports a >>> "maintenance mode" where it is isolated from the other boxes by using the >>> firewall to block it's communications. >>> >>> When we put a box in maintenance mode, we get these messages in >>> corosync.log: >>> >>> Jan 08 06:08:07 corosync [TOTEM ] Totem is unable to form a cluster because >>> of an operating system or network fault. The most common cause of this >>> message is that the local firewall is configured improperly. >>> >>> as well as messages like these: >>> >>> Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Child 10942 >>> spawned to record non-fatal assertion failure line 1591: rc == 0 >>> Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Message not >>> sent (-1): <create_request_adv origin="do_election_vote" t="crmd" >>> version="3.0.1" subt="request" reference="vote-crmd-1326002888-12 >>> Jan 08 06:08:08 corosync [pcmk ] WARN: route_ais_message: Sending message >>> to <all>.crmd failed: cluster delivery failed (rc=-1) >>> Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Child 10943 >>> spawned to record non-fatal assertion failure line 1591: rc == 0 >>> Jan 08 06:08:08 corosync [pcmk ] ERROR: send_cluster_msg_raw: Message not >>> sent (-1): <create_request_adv origin="join_make_offer" t="crmd" >>> version="3.0.1" subt="request" reference="join_offer-dc-1326002888 >>> Jan 08 06:08:08 corosync [pcmk ] WARN: route_ais_message: Sending message >>> to agile4-ctx1-db1.crmd failed: cluster delivery failed (rc=-1) >>> >>> Note the messages from send_cluster_msg_raw. Every so often we get a >>> coredump. Here is the stack: >>> >>> #0 0x00ee87a2 in _dl_sysinfo_int80 () >>> from /lib/ld-linux.so.2 >>> (gdb) bt >>> #0 0x00ee87a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 >>> #1 0x00138825 in raise () from /lib/tls/libc.so.6 >>> #2 0x0013a289 in abort () from /lib/tls/libc.so.6 >>> #3 0x005f296b in send_cluster_msg_raw () from >>> /usr/libexec/lcrso/pacemaker.lcrso >>> #4 0x005f2510 in route_ais_message () from >>> /usr/libexec/lcrso/pacemaker.lcrso >>> #5 0x005f0759 in pcmk_ipc () from /usr/libexec/lcrso/pacemaker.lcrso >>> #6 0x00c7b269 in coroipcs_response_iov_send () from >>> /usr/lib/libcoroipcs.so.4 >>> #7 0x00b163cc in start_thread () from /lib/tls/libpthread.so.0 >>> #8 0x001dcf0e in clone () from /lib/tls/libc.so.6 >>> >>> as the stack shows, send_cluster_msg_raw() is calling abort. >>> >>> I looked at the code for send_cluster_msg_raw in the pacemaker code >>> (plugin.c). AIS_ASSERT calls abort, so it looks like one of these 2 lines >>> is aborting: >>> >>> AIS_ASSERT(local_nodeid != 0); >>> AIS_ASSERT(ais_msg->header.size == (sizeof(AIS_Message) + >>> ais_data_len(ais_msg))); >> >> Actually you're hitting: >> >> AIS_CHECK(rc == 0, ais_err("Message not sent (%d): %.120s", rc, >> mutable->data)); >> >> Which is line 1591 of the file containing send_cluster_msg_raw, hence: >> "Child 10942 spawned to record non-fatal assertion failure line >> 1591: rc == 0" >> >> Very strange, that means that the call: >> rc = pcmk_api->totem_mcast(&iovec, 1, TOTEMPG_SAFE); >> failed. >> >> Steve might have some more ideas as to why that would happen. >> > > totem_mcast fails when the new message queue is full. This would happen > if the protocol was blocked for long periods of time while messages were > continually added (for example iptables was enabled..) IPC requests > block (and return ERR_TRY_AGAIN) to avoid this problem but this > typicallly doesn't happen in service engines because there is a 1:1 > mapping between ipc requests and totem messages sent. > > Back to the original problem, is there any way for pacemaker to handle a > full message queue in this condition? Yes. The node will get fenced. To the rest of the cluster it appears offline. There is no such thing as a healthy node that you can't communicate with. > > Regards > -steve > >>> We considered stopping corosync while in maintenance mode, but one of our >>> nodes will shutdown if corosync is not running, so that is not an option for >>> us. >>> >>> Is there a way to keep corosync running but not doing anything? Of course >>> when we leave maintenance mode, we want corosync to start sending messages >>> again. Any other ideas on how to handle this? >> >> In this case, not loading the pacemaker side of things would be enough. >> >>> -gm >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss