Re: Node is randomly fenced

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Thu, 19 Jun 2014 11:02:58 +0100

On 17/06/14 15:27, Schaefer, Micah wrote:
I am running Red Hat 6.4 with the HA/ load balancing packages from the
install DVD.

-bash-4.1$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.4 (Santiago)

-bash-4.1$ corosync -v
Corosync Cluster Engine, version '1.4.1'
Copyright (c) 2006-2009 Red Hat, Inc.

Thanks. 6.5 has better pause detection in it but I don't think that's 
the issue here actually. It looks to me like some messages are getting 
through but not others. So I'm back to seriously wondering if multicast 
traffic is being forwarded correctly and reliably. Having a mix of 
virtual and physical systems can cause these sorts of issues with real 
and software switches being mixed. Though I haven't seen anything quite 
as odd as this to be honest.

Can you try either UDPU (preferred) or broadcast transport please and 
see if that helps or changes the symptoms at all? Broadcast could be 
problematic itself with the real/virtual mix so UDPU will be a more 
reliable option.

Annoyingly, you'll need to take down the whole cluster to do this, and add

<cman transport="udpu"/>

to /etc/cluster/cluster.conf on all nodes.

Chrissie

On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx> wrote:

On 12/06/14 20:06, Digimer wrote:
Hrm, I'm not really sure that I am able to interpret this without making
guesses. I'm cc'ing one of the devs (who I hope will poke the right
person if he's not able to help at the moment). Lets see what he has to
say.

I am curious now, too. :)

On 12/06/14 03:02 PM, Schaefer, Micah wrote:
Node4 was fenced again, I was able to get some debug logs (below), a
new
message :

"Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the
OPERATIONAL
state.³

Rest of corosync logs

http://pastebin.com/iYFbkbhb

Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
flushing membership messages.

I'm concerned that the pause messages are repeating like that, it looks
like it might be a fixed bug. What version of corosync do you have?

Chrissie

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster