Re: corosync 1.4.1-1 coredump

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 2, 2012 at 2:55 PM, Steven Dake <sdake@xxxxxxxxxx> wrote:
> On 02/01/2012 06:39 PM, Andrew Beekhof wrote:
>> On Thu, Feb 2, 2012 at 6:24 AM, Grant Martin (granmart)
>> <granmart@xxxxxxxxx> wrote:
>>> Hi,
>>> We have a 6 box cluster running with corosync 1.4.1-1.  Each box supports a
>>> "maintenance mode" where it is isolated from the other boxes by using the
>>> firewall to block it's communications.
>>>
>>> When we put a box in maintenance mode, we get these messages in
>>> corosync.log:
>>>
>>> Jan 08 06:08:07 corosync [TOTEM ] Totem is unable to form a cluster because
>>> of an operating system or network fault. The most common cause of this
>>> message is that the local firewall is configured improperly.
>>>
>>> as well as messages like these:
>>>
>>> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Child 10942
>>> spawned to record non-fatal assertion failure line 1591: rc == 0
>>> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Message not
>>> sent (-1): <create_request_adv origin="do_election_vote" t="crmd"
>>> version="3.0.1" subt="request" reference="vote-crmd-1326002888-12
>>> Jan 08 06:08:08 corosync [pcmk  ] WARN: route_ais_message: Sending message
>>> to <all>.crmd failed: cluster delivery failed (rc=-1)
>>> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Child 10943
>>> spawned to record non-fatal assertion failure line 1591: rc == 0
>>> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Message not
>>> sent (-1): <create_request_adv origin="join_make_offer" t="crmd"
>>> version="3.0.1" subt="request" reference="join_offer-dc-1326002888
>>> Jan 08 06:08:08 corosync [pcmk  ] WARN: route_ais_message: Sending message
>>> to agile4-ctx1-db1.crmd failed: cluster delivery failed (rc=-1)
>>>
>>> Note the messages from send_cluster_msg_raw.  Every so often we get a
>>> coredump.  Here is the stack:
>>>
>>> #0  0x00ee87a2 in _dl_sysinfo_int80 ()
>>>    from /lib/ld-linux.so.2
>>> (gdb) bt
>>> #0  0x00ee87a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
>>> #1  0x00138825 in raise () from /lib/tls/libc.so.6
>>> #2  0x0013a289 in abort () from /lib/tls/libc.so.6
>>> #3  0x005f296b in send_cluster_msg_raw () from
>>> /usr/libexec/lcrso/pacemaker.lcrso
>>> #4  0x005f2510 in route_ais_message () from
>>> /usr/libexec/lcrso/pacemaker.lcrso
>>> #5  0x005f0759 in pcmk_ipc () from /usr/libexec/lcrso/pacemaker.lcrso
>>> #6  0x00c7b269 in coroipcs_response_iov_send () from
>>> /usr/lib/libcoroipcs.so.4
>>> #7  0x00b163cc in start_thread () from /lib/tls/libpthread.so.0
>>> #8  0x001dcf0e in clone () from /lib/tls/libc.so.6
>>>
>>> as the stack shows, send_cluster_msg_raw() is calling abort.
>>>
>>> I looked at the code for send_cluster_msg_raw in the pacemaker code
>>> (plugin.c).  AIS_ASSERT calls abort, so it looks like one of these 2 lines
>>> is aborting:
>>>
>>>     AIS_ASSERT(local_nodeid != 0);
>>>     AIS_ASSERT(ais_msg->header.size == (sizeof(AIS_Message) +
>>> ais_data_len(ais_msg)));
>>
>> Actually you're hitting:
>>
>>     AIS_CHECK(rc == 0, ais_err("Message not sent (%d): %.120s", rc,
>> mutable->data));
>>
>> Which is line 1591 of the file containing send_cluster_msg_raw, hence:
>>    "Child 10942 spawned to record non-fatal assertion failure line
>> 1591: rc == 0"
>>
>> Very strange, that means that the call:
>>     rc = pcmk_api->totem_mcast(&iovec, 1, TOTEMPG_SAFE);
>> failed.
>>
>> Steve might have some more ideas as to why that would happen.
>>
>
> totem_mcast fails when the new message queue is full.  This would happen
> if the protocol was blocked for long periods of time while messages were
> continually added (for example iptables was enabled..)  IPC requests
> block (and return ERR_TRY_AGAIN) to avoid this problem but this
> typicallly doesn't happen in service engines because there is a 1:1
> mapping between ipc requests and totem messages sent.
>
> Back to the original problem, is there any way for pacemaker to handle a
> full message queue in this condition?

Yes. The node will get fenced.

To the rest of the cluster it appears offline.
There is no such thing as a healthy node that you can't communicate with.

>
> Regards
> -steve
>
>>> We considered stopping corosync while in maintenance mode, but one of our
>>> nodes will shutdown if corosync is not running, so that is not an option for
>>> us.
>>>
>>> Is there a way to keep corosync running but not doing anything?  Of course
>>> when we leave maintenance mode, we want corosync to start sending messages
>>> again.  Any other ideas on how to handle this?
>>
>> In this case, not loading the pacemaker side of things would be enough.
>>
>>> -gm
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss



[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux