Hi,
I'm appear to be a experiencing a strange compound problem with this, that
is proving rather difficult to troubleshoot, so I'm hoping someone here
can spot a problem I hadn't.
I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single
node mounts GFS OK and works, but after a while seems to just block for
disk. Very much as if it started trying to fence the other node and is
waiting for acknowledgement. There are no fence devices defined (so this
could be a possibility), but the other node was never powered up in the
first place, so it is somewhat beyond me why it might suddenly decide to
try to fence it. This usually happens after a period of idleness. If the
node is used, this doesn't seem to happen, but leaving it along for half
an hour causes it to block for disk I/O.
Unfortunately, it doesn't end there. When an attempt is made to dual-mount
the GFS file system before the secondary is fully up to date (but is
connected and syncing), the 2nd node to join notices an inconsistency, and
withdraws from the cluster. In the process, GFS gets corrupted, and the
only way to get it to mount again on either node is to repair it with
fsck.
I'm not sure if this is a problem with my cluster setup or not, but I
cannot see that the nodes would fail to find each other and get DLM
working. Console logs seem to indicate that everything is in fact OK, and
the nodes are connected directly via a cross-over cable.
If the nodes are in sync by the time GFS tries to mount, the mount succeeds,
but everything grinds to a halt shortly afterwards - so much so that the only
way to get things moving again is to hard-reset one of the nodes, preferably
the 2nd one to join.
Here is where the second thing that seems wrong happend - the first node
doesn't just lock-up at this point, as one might expect (when a connected node
disappears, e.g. due to a hard reset, cluster is supposed to try to fence it
until it cleanly rejoins - and it can't possibly fence the other node since I
haven't configured any fencing devices yet). This doesn't seem to happen. The
first node seems to continue like nothing happened. This is possibly connected
to the fact that by this point, GFS is corrupted and has to be fsck-ed at next
boot. This part may be a cluster setup issue, so I'll raise that on the cluster
list, although it seems to be a DRBD specific peculiarity - using a SAN doesn't
have this issue with a nearly identical cluster.conf (only difference being the
block device specification).
The cluster.conf is as follows:
<?xml version="1.0"?>
<cluster config_version="18" name="sentinel">
<cman two_node="1" expected_votes="1"/>
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="sentinel1c" nodeid="1" votes="1">
<com_info>
<rootsource name="drbd"/>
<!--<chrootenv mountpoint = "/var/comoonics/chroot"
fstype = "ext3"
device = "/dev/sda2"
chrootdir = "/var/comoonics/chroot"
/>-->
<syslog name="localhost"/>
<rootvolume name = "/dev/drbd1"
mountopts = "noatime,nodiratime,noquota"
/>
<eth name = "eth0"
ip = "10.0.0.1"
mac = "00:0B:DB:92:C5:E1"
mask = "255.255.255.0"
gateway = ""
/>
<fenceackserver user = "root"
passwd = "secret"
/>
</com_info>
<fence>
<method name="1"/>
</fence>
</clusternode>
<clusternode name="sentinel2c" nodeid="2" votes="1">
<com_info>
<rootsource name="drbd"/>
<!--<chrootenv mountpoint = "/var/comoonics/chroot"
fstype = "ext3"
device = "/dev/sda2"
chrootdir = "/var/comoonics/chroot"
/>-->
<syslog name="localhost"/>
<rootvolume name = "/dev/drbd1"
mountopts = "noatime,nodiratime,noquota"
/>
<eth name = "eth0"
ip = "10.0.0.2"
mac = "00:0B:DB:90:4E:1B"
mask = "255.255.255.0"
gateway = ""
/>
<fenceackserver user = "root"
passwd = "secret"
/>
</com_info>
<fence>
<method name="1"/>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices/>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
Getting to the logs can be a bit difficult with OSR (they get reset on
reboot, and it's rather difficult getting to them when the node stops
responding without rebooting it), so I don't have those at the moment.
Any suggestions would be welcome at this point.
TIA.
Gordan
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster