Re: Node is randomly fenced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 17/06/14 15:27, Schaefer, Micah wrote:
I am running Red Hat 6.4 with the HA/ load balancing packages from the
install DVD.


-bash-4.1$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.4 (Santiago)

-bash-4.1$ corosync -v
Corosync Cluster Engine, version '1.4.1'
Copyright (c) 2006-2009 Red Hat, Inc.




Thanks. 6.5 has better pause detection in it but I don't think that's the issue here actually. It looks to me like some messages are getting through but not others. So I'm back to seriously wondering if multicast traffic is being forwarded correctly and reliably. Having a mix of virtual and physical systems can cause these sorts of issues with real and software switches being mixed. Though I haven't seen anything quite as odd as this to be honest.

Can you try either UDPU (preferred) or broadcast transport please and see if that helps or changes the symptoms at all? Broadcast could be problematic itself with the real/virtual mix so UDPU will be a more reliable option.

Annoyingly, you'll need to take down the whole cluster to do this, and add

<cman transport="udpu"/>

to /etc/cluster/cluster.conf on all nodes.

Chrissie




On 6/17/14, 8:41 AM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx> wrote:

On 12/06/14 20:06, Digimer wrote:
Hrm, I'm not really sure that I am able to interpret this without making
guesses. I'm cc'ing one of the devs (who I hope will poke the right
person if he's not able to help at the moment). Lets see what he has to
say.

I am curious now, too. :)

On 12/06/14 03:02 PM, Schaefer, Micah wrote:
Node4 was fenced again, I was able to get some debug logs (below), a
new
message :

"Jun 12 14:01:56 corosync [TOTEM ] The token was lost in the
OPERATIONAL
state.³


Rest of corosync logs

http://pastebin.com/iYFbkbhb


Jun 12 14:44:49 corosync [TOTEM ] entering OPERATIONAL state.
Jun 12 14:44:49 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 12 14:44:49 corosync [TOTEM ] waiting_trans_ack changed to 0
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] entering GATHER state from 12.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 32947 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33016 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33086 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:49 corosync [TOTEM ] Process pause detected for 33155 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33224 ms,
flushing membership messages.
Jun 12 14:44:50 corosync [TOTEM ] Process pause detected for 33225 ms,
flushing membership messages.


I'm concerned that the pause messages are repeating like that, it looks
like it might be a fixed bug. What version of corosync do you have?

Chrissie

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster





[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux