José,
Fencing is not optional but mandatory for GFS. Once the failing node is
detected the cluster nodes will *wait* until the failing node is
successfully fenced. Once fenced (i.e. power-cycled or disconnected from
the SAN) one of the cluster nodes will replay the journal of the failing
node and GFS operation continues. Without fencing the cluster will hang
on any lock that is obtained by the failing node (like your hanging
systems).
Install a proper fencing agent for operational use. For testing purposes
you could use manual fencing (i.e. run fence_manual).
PS: Plugging the cable back in without power-cycling is a NO-GO. The
failing node is no longer in-sync with the rest of the cluster (they
assume the machine has been power-cycled after a manual fence) - you
will risk GFS filesystem corruption by attaching it back to the storage
without proper fencing procedures!
Jeroen
José Miguel Parrella Romero wrote:
Greetings,
I've been trying to setup a two-node cluster using a shared SAN (via
Fibre Channel) and GFS. I've previously tried OCFS2, and I don't want to
use NFS yet. The cluster must be an active-active one, and it runs on
Itanium2 machines with Debian 4.0. I'm using cman 1.03.00
I've setup a cluster using Red Hat tools, and my
/etc/cluster/cluster.conf looks like:
-- my cluster.conf --
<?xml version="1.0"?>
<cluster name="correo" config_version="1">
<cman two_node="1" expected_votes="1">
</cman>
<clusternodes>
<clusternode name="node1" votes="1">
</clusternode>
<clusternode name="node2" votes="1">
</clusternode>
</clusternodes>
</cluster>
-- end my cluster.conf --
Note that I've removed entries related to fencing, but I previously had
a 'manual' fencing method. So I've an LVM volume which contains a GFS
filesystem, and I'm able to start ccsd, cman, fenced, clvmd and all the
other related applications.
Syslog reports that the cluster is quorate, and I'm able to mount the
filesystem in both of my nodes. They need to write to the shared storage
in an active-active fashion.
I expect that removing the network cable in node1 would do the following:
a) node1 would be disabled (right, it doesn't have a network cable)
b) node2 would notice node1 is not there and will keep writing to the
shared storage
c) Eventually node1 will come back, and node2 will notice it, so it will
hopefully start writing again
And this it what happens when I unplug the network cable:
a) node1 is disabled (no connectivity)
b) node2 is also disabled! (trying to write to /home and /var/mail
stalls the machine, and then logins and other processes are stalled)
c) Plugging the cable back does nothing (both machines are hanged now,
so I need to reboot them)
I'm probably missing something, since this solution using OCFS2 also has
the same problem! Our last-resort solution is active-active NFS using
Heartbeat, but then we wouldn't be writing to the SAN through FC (2Gbps)
but through Ethernet (1Gbps) since we don't have any other media around ATM.
Is this a configuration related problem? Or is this a design feature in
both GFS/OCFS2? Or maybe I'm just missing the whole picture...
Thank you very much for any advice,
Jose
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster