Re: corosync 1.4.1-1 coredump

Steven Dake <sdake@xxxxxxxxxx> · Wed, 01 Feb 2012 15:31:03 -0700

On 02/01/2012 12:24 PM, Grant Martin (granmart) wrote:
> Hi,
> We have a 6 box cluster running with corosync 1.4.1-1.  Each box
> supports a "maintenance mode" where it is isolated from the other
> boxes by using the firewall to block it's communications.
>  
> When we put a box in maintenance mode, we get these messages in
> corosync.log:
>  
> Jan 08 06:08:07 corosync [TOTEM ] Totem is unable to form a cluster
> because of an operating system or network fault. The most common cause
> of this message is that the local firewall is configured improperly.
>  
> as well as messages like these:
>  
> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Child
> 10942 spawned to record non-fatal assertion failure line 1591: rc == 0
> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Message
> not sent (-1): <create_request_adv origin="do_election_vote" t="crmd"
> version="3.0.1" subt="request" reference="vote-crmd-1326002888-12
> Jan 08 06:08:08 corosync [pcmk  ] WARN: route_ais_message: Sending
> message to <all>.crmd failed: cluster delivery failed (rc=-1)
> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Child
> 10943 spawned to record non-fatal assertion failure line 1591: rc == 0
> Jan 08 06:08:08 corosync [pcmk  ] ERROR: send_cluster_msg_raw: Message
> not sent (-1): <create_request_adv origin="join_make_offer" t="crmd"
> version="3.0.1" subt="request" reference="join_offer-dc-1326002888
> Jan 08 06:08:08 corosync [pcmk  ] WARN: route_ais_message: Sending
> message to agile4-ctx1-db1.crmd failed: cluster delivery failed (rc=-1)
>  
> Note the messages from send_cluster_msg_raw.  Every so often we get a
> coredump.  Here is the stack:
>  
> #0  0x00ee87a2 in _dl_sysinfo_int80 ()
>    from /lib/ld-linux.so.2
> (gdb) bt
> #0  0x00ee87a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0x00138825 in raise () from /lib/tls/libc.so.6
> #2  0x0013a289 in abort () from /lib/tls/libc.so.6
> #3  0x005f296b in send_cluster_msg_raw () from
> /usr/libexec/lcrso/pacemaker.lcrso
> #4  0x005f2510 in route_ais_message () from
> /usr/libexec/lcrso/pacemaker.lcrso
> #5  0x005f0759 in pcmk_ipc () from /usr/libexec/lcrso/pacemaker.lcrso
> #6  0x00c7b269 in coroipcs_response_iov_send () from
> /usr/lib/libcoroipcs.so.4
> #7  0x00b163cc in start_thread () from /lib/tls/libpthread.so.0
> #8  0x001dcf0e in clone () from /lib/tls/libc.so.6
>  
> as the stack shows, send_cluster_msg_raw() is calling abort. 
>  

Could you please install pacemaker-debuginfo and corosync-debuginfo and
generate the backtrace again?

> I looked at the code for send_cluster_msg_raw in the pacemaker code
> (plugin.c).  AIS_ASSERT calls abort, so it looks like one of these 2
> lines is aborting:
>  
>     AIS_ASSERT(local_nodeid != 0);
>     AIS_ASSERT(ais_msg->header.size == (sizeof(AIS_Message) +
> ais_data_len(ais_msg)));
> We considered stopping corosync while in maintenance mode, but one of
> our nodes will shutdown if corosync is not running, so that is not an
> option for us. 
>  
> Is there a way to keep corosync running but not doing anything?  Of
> course when we leave maintenance mode, we want corosync to start sending
> messages again.  Any other ideas on how to handle this?
> -gm
>  
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss