Re: corosync 1.4.1-1 coredump

Andrew Beekhof <andrew@xxxxxxxxxxx> · Thu, 2 Feb 2012 12:39:01 +1100

On Thu, Feb 2, 2012 at 6:24 AM, Grant Martin (granmart)
<granmart@xxxxxxxxx> wrote:
> Hi,
> We have a 6 box cluster running with corosync 1.4.1-1.  Each box supports a
> "maintenance mode" where it is isolated from the other boxes by using the
> firewall to block it's communications.
>
> When we put a box in maintenance mode, we get these messages in
> corosync.log:
>
> Jan 08 06:08:07 corosync [TOTEM ] Totem is unable to form a cluster because
> of an operating system or network fault. The most common cause of this
> message is that the local firewall is configured improperly.
>
> as well as messages like these:
>
> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Child 10942
> spawned to record non-fatal assertion failure line 1591: rc == 0
> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Message not
> sent (-1): <create_request_adv origin="do_election_vote" t="crmd"
> version="3.0.1" subt="request" reference="vote-crmd-1326002888-12
> Jan 08 06:08:08 corosync [pcmk  ] WARN: route_ais_message: Sending message
> to <all>.crmd failed: cluster delivery failed (rc=-1)
> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Child 10943
> spawned to record non-fatal assertion failure line 1591: rc == 0
> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Message not
> sent (-1): <create_request_adv origin="join_make_offer" t="crmd"
> version="3.0.1" subt="request" reference="join_offer-dc-1326002888
> Jan 08 06:08:08 corosync [pcmk  ] WARN: route_ais_message: Sending message
> to agile4-ctx1-db1.crmd failed: cluster delivery failed (rc=-1)
>
> Note the messages from send_cluster_msg_raw.  Every so often we get a
> coredump.  Here is the stack:
>
> #0  0x00ee87a2 in _dl_sysinfo_int80 ()
>    from /lib/ld-linux.so.2
> (gdb) bt
> #0  0x00ee87a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0x00138825 in raise () from /lib/tls/libc.so.6
> #2  0x0013a289 in abort () from /lib/tls/libc.so.6
> #3  0x005f296b in send_cluster_msg_raw () from
> /usr/libexec/lcrso/pacemaker.lcrso
> #4  0x005f2510 in route_ais_message () from
> /usr/libexec/lcrso/pacemaker.lcrso
> #5  0x005f0759 in pcmk_ipc () from /usr/libexec/lcrso/pacemaker.lcrso
> #6  0x00c7b269 in coroipcs_response_iov_send () from
> /usr/lib/libcoroipcs.so.4
> #7  0x00b163cc in start_thread () from /lib/tls/libpthread.so.0
> #8  0x001dcf0e in clone () from /lib/tls/libc.so.6
>
> as the stack shows, send_cluster_msg_raw() is calling abort.
>
> I looked at the code for send_cluster_msg_raw in the pacemaker code
> (plugin.c).  AIS_ASSERT calls abort, so it looks like one of these 2 lines
> is aborting:
>
>     AIS_ASSERT(local_nodeid != 0);
>     AIS_ASSERT(ais_msg->header.size == (sizeof(AIS_Message) +
> ais_data_len(ais_msg)));

Actually you're hitting:

    AIS_CHECK(rc == 0, ais_err("Message not sent (%d): %.120s", rc,
mutable->data));

Which is line 1591 of the file containing send_cluster_msg_raw, hence:
   "Child 10942 spawned to record non-fatal assertion failure line
1591: rc == 0"

Very strange, that means that the call:
    rc = pcmk_api->totem_mcast(&iovec, 1, TOTEMPG_SAFE);
failed.

Steve might have some more ideas as to why that would happen.

> We considered stopping corosync while in maintenance mode, but one of our
> nodes will shutdown if corosync is not running, so that is not an option for
> us.
>
> Is there a way to keep corosync running but not doing anything?  Of course
> when we leave maintenance mode, we want corosync to start sending messages
> again.  Any other ideas on how to handle this?

In this case, not loading the pacemaker side of things would be enough.

> -gm
>
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss