RHEL5.3 / cman-2.0.98-1.el5 / Problem loop on "Node x is undead"

"Alain.Moulle" <Alain.Moulle@xxxxxxxx> · Wed, 25 Feb 2009 16:35:44 +0100

Alain.Moulle wrote:

    > Hi,
> 
> I'm facing again this problem of Node  evicted and Node is undead ...
> And I really don't know what to do ... below are the traces in syslog.
> My version is :RHEL5.3 / cman-2.0.98-1.el5
> 
> Feb 25 14:33:33 s_sys@xn3 qdiskd[27582]: <notice> Writing eviction
> notice for node 2
> Feb 25 14:33:34 s_sys@xn3 qdiskd[27582]: <notice> Node 2 evicted
> Feb 25 14:33:35 s_sys@xn3 qdiskd[27582]: <crit> Node 2 is undead.
> ... etc.
> Feb 25 14:33:45 s_sys@xn3 qdiskd[27582]: <crit> Node 2 is undead.
> Feb 25 14:33:45 s_sys@xn3 qdiskd[27582]: <alert> Writing eviction notice
> for node 2
> Feb 25 14:33:46 s_sys@xn3 qdiskd[27582]: <crit> Node 2 is undead.
> Feb 25 14:33:46 s_sys@xn3 qdiskd[27582]: <alert> Writing eviction notice
> for node 2
> Feb 25 14:33:47 s_kernel@xn3 kernel: dlm: closing connection to node 2
> Feb 25 14:33:47 s_sys@xn3 fenced[27785]: xn4 not a cluster member after
> 0 sec post_fail_delay
> Feb 25 14:33:47 s_sys@xn3 fenced[27785]: fencing node "xn4"
> Feb 25 14:33:47 s_sys@xn3 qdiskd[27582]: <crit> Node 2 is undead.
> ...etc.
> Feb 25 14:33:52 s_sys@xn3 qdiskd[27582]: <alert> Writing eviction notice
> for node 2
> Feb 25 14:33:52 s_sys@xn3 fenced[27785]: fence "xn4" success
> Feb 25 14:33:53 s_sys@xn3 qdiskd[27582]: <crit> Node 2 is undead.
> Feb 25 14:33:53 s_sys@xn3 qdiskd[27582]: <alert> Writing eviction notice
> for node 2
> Feb 25 14:33:54 s_sys@xn3 qdiskd[27582]: <crit> Node 2 is undead.
> Feb 25 14:33:54 s_sys@xn3 qdiskd[27582]: <alert> Writing eviction notice
> for node 2
> Feb 25 14:33:54 s_sys@xn3 clurgmgrd[27990]: <notice> Taking over service
> service:lustre_xn4 from down member xn4
> Feb 25 14:33:55 s_sys@xn3 qdiskd[27582]: <crit> Node 2 is undead.
> .. etc.
> 
> An then after reboot of xn4 , when we try to start the CS on xn4, it
> can't enter in the cluster, and we
> must stop CS on both nodes and start on both sides again.
> 
> Where could this problem come from ? How can I avoid this eviction of
> node  ?
> 
> Any help would be very appreciated .

You haven't posted any cman/openais messages but it's quite possible
you've hit this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=485026

There's a patch included and some links to fixed RPMs.

Chrissie

Thanks Chrissie, but I have checked this bugzilla, and it seems,
except

if I'm misunderstanding, to be more on the problem of starting a second

node too late with regard to the start of a first node ... so that in
fact

the second node can't enter in the cluster anymore. But there are no

"Node is undead" messages in the syslog in this case (I've checked the
joined 

syslog in the bugzilla). 

My problem is after a poweroff -f on a node of a ha pair with quorum
disk

but when both nodes are up and running their services : in this case ,
making a

poweroff on second node makes the first one generate the loop "Node 2
evicted"

and "Node 2 is undead" in syslog, and this even just after the
poweroff, not when

the second node is trying to start the CS again .

Regards,

Alain

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster