Re: problem with rejoining a node

Patrick Caulfield <pcaulfie@xxxxxxxxxx> · Mon, 08 Aug 2005 16:35:14 +0100

Javi Polo wrote:
> Hi there (again :P)
> 
> I'm still fighting with all this, sorry to bother so much (hope some day
> when I understand it all better I'll write some article on how to set this up)
> 
> Well, I have already up the cluster and mounted the gfs filesystem in 3
> machines, and if one of those goes down, it's correctly fenced. The FC
> port is also disconnected, so I suppose at this point is everything ok.
> 
> The problem is on the recovery. I understand that when a node rejoins
> is automaticaly unfenced, and then it can rejoin the fence and
> mount again the filesystem.
> 
> I've blocked all input and output traffic on the node I want to test
> with iptables.
> 
> The node gets fenced ok:
> Aug  8 16:00:48 gfstest2 fenced[2594]: fencing node "gfstest1"
> Aug  8 16:00:56 gfstest2 fenced[2594]: fence "gfstest1" success

What sort of fencing are you using? If it's a power-switch fence then the
node should be hard rebooted. If it's SAN fencing then you'll have to get the
node out of the cluster - the remaining two nodes /should/ tell it it leave the
cluster.

A node can't just "rejoin" a cluster after being SAN fenced. it must be removed
from the cluster and rejoin from scratch. There's far too much state involved
for it to merge  seamlessly back into a cluster.

> Now I can access the GFS filesystem safely from my other 2 nodes, as the
> FC port for gfstest1 is disabled now, but if I enable traffic for the
> node, it does not rejoin the cluster. Shouldnt this be automatically?
> 
> Anyway, I cannot rejoin/leave/whatever the cluster from gfstest1:
> gfstest1:~# cman_tool services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           1   2 run       -
> [1 2 3]
> 
> DLM Lock Space:  "primer_fs"                         2   3 run       -
> [1 2 3]
> 
> GFS Mount Group: "primer_fs"                         3   4 run       -
> [1 2 3]
> 
> gfstest1:~# cman_tool join
> cman_tool: Node is already active
> gfstest1:~# cman_tool leave
> cman_tool: Can't leave cluster while there are 5 active subsystems

cman_tool leave force will force it to leave, but you might find it still needs
a reboot to clear the filesystems.

> and also, I cannot umount /dev/sdc1 as I have no access to the SAN
> (and however DLM should block him not to do so). So I get a totally
> screwed up system, that I can just fix by hard-rebooting (if I do a
> clean reboot, the system "hangs" while "umounting filesystems").
> 
> Also, when the system boots up, the SAN is still unaccessible, as the
> fencing script does not run to re-enable the port ...
> 
> I'm loooooost diving into google querys ... and certainly it's hard to
> find accurate info about all this :/
> 
> could someone spot some light?
> (probably I dont understand well how the fencing system works, but also
> havent find anywhere where its explained :/)
> 
> thx in advance :)

-- 

patrick

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster