On Monday 03 March 2008 12:23:36 gordan@xxxxxxxxxx wrote: > Hi, > > I'm appear to be a experiencing a strange compound problem with this, that > is proving rather difficult to troubleshoot, so I'm hoping someone here > can spot a problem I hadn't. > > I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single > node mounts GFS OK and works, but after a while seems to just block for > disk. Very much as if it started trying to fence the other node and is > waiting for acknowledgement. There are no fence devices defined (so this > could be a possibility), but the other node was never powered up in the > first place, so it is somewhat beyond me why it might suddenly decide to > try to fence it. This usually happens after a period of idleness. If the > node is used, this doesn't seem to happen, but leaving it along for half > an hour causes it to block for disk I/O. As I cannot help you too much with DBRB problems here some infos to help you debugging them at least ;-) . Regarding OSR being stuck (manual fencing): You should try using the fenceacksv. As far as I am informed of your configuration it also is configured: <clusternode name="node1" nodeid="1" votes="1"> <com_info> <rootsource name="drbd"/> <!--<chrootenv mountpoint = "/var/comoonics/chroot" fstype = "ext3" device = "/dev/sda2" chrootdir = "/var/comoonics/chroot" />--> <syslog name="localhost"/> <rootvolume name = "/dev/drbd1" mountopts = "noatime,nodiratime,noquota" /> <eth name = "eth0" ip = "10.0.0.1" mac = "xxx" mask = "255.0.0.0" gateway = "" /> <fenceackserver user = "root" passwd = "password" /> </com_info> Now you could do a telnet on the hung node on port 12242 login and should automatically see, if it is in manual fencing state or not. If you also install comoonics-fenceacksv-plugins-py you will be able to trigger sysrqs via the fenceacksv. Hope that helps with debugging. Marc. > > Unfortunately, it doesn't end there. When an attempt is made to dual-mount > the GFS file system before the secondary is fully up to date (but is > connected and syncing), the 2nd node to join notices an inconsistency, and > withdraws from the cluster. In the process, GFS gets corrupted, and the > only way to get it to mount again on either node is to repair it with > fsck. > > I'm not sure if this is a problem with my cluster setup or not, but I > cannot see that the nodes would fail to find each other and get DLM > working. Console logs seem to indicate that everything is in fact OK, and > the nodes are connected directly via a cross-over cable. > > If the nodes are in sync by the time GFS tries to mount, the mount > succeeds, but everything grinds to a halt shortly afterwards - so much so > that the only way to get things moving again is to hard-reset one of the > nodes, preferably the 2nd one to join. > > Here is where the second thing that seems wrong happend - the first node > doesn't just lock-up at this point, as one might expect (when a connected > node disappears, e.g. due to a hard reset, cluster is supposed to try to > fence it until it cleanly rejoins - and it can't possibly fence the other > node since I haven't configured any fencing devices yet). This doesn't seem > to happen. The first node seems to continue like nothing happened. This is > possibly connected to the fact that by this point, GFS is corrupted and has > to be fsck-ed at next boot. This part may be a cluster setup issue, so I'll > raise that on the cluster list, although it seems to be a DRBD specific > peculiarity - using a SAN doesn't have this issue with a nearly identical > cluster.conf (only difference being the block device specification). > > The cluster.conf is as follows: > <?xml version="1.0"?> > <cluster config_version="18" name="sentinel"> > <cman two_node="1" expected_votes="1"/> > <fence_daemon post_fail_delay="0" post_join_delay="3"/> > <clusternodes> > <clusternode name="sentinel1c" nodeid="1" votes="1"> > <com_info> > <rootsource name="drbd"/> > <!--<chrootenv mountpoint = > "/var/comoonics/chroot" fstype = "ext3" device = > "/dev/sda2" chrootdir = "/var/comoonics/chroot" />--> > <syslog name="localhost"/> > <rootvolume name = > "/dev/drbd1" mountopts = "noatime,nodiratime,noquota" /> > <eth name = "eth0" > ip = "10.0.0.1" > mac = "00:0B:DB:92:C5:E1" > mask = "255.255.255.0" > gateway = "" > /> > <fenceackserver user = "root" > passwd = "secret" > /> > </com_info> > <fence> > <method name="1"/> > </fence> > </clusternode> > <clusternode name="sentinel2c" nodeid="2" votes="1"> > <com_info> > <rootsource name="drbd"/> > <!--<chrootenv mountpoint = > "/var/comoonics/chroot" fstype = "ext3" device = > "/dev/sda2" chrootdir = "/var/comoonics/chroot" />--> > <syslog name="localhost"/> > <rootvolume name = > "/dev/drbd1" mountopts = "noatime,nodiratime,noquota" /> > <eth name = "eth0" > ip = "10.0.0.2" > mac = "00:0B:DB:90:4E:1B" > mask = "255.255.255.0" > gateway = "" > /> > <fenceackserver user = "root" > passwd = "secret" > /> > </com_info> > <fence> > <method name="1"/> > </fence> > </clusternode> > </clusternodes> > <cman/> > <fencedevices/> > <rm> > <failoverdomains/> > <resources/> > </rm> > </cluster> > > Getting to the logs can be a bit difficult with OSR (they get reset on > reboot, and it's rather difficult getting to them when the node stops > responding without rebooting it), so I don't have those at the moment. > > Any suggestions would be welcome at this point. > > TIA. > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Marc Grimme Phone: +49-89 452 3538-14 http://www.atix.de/ http://www.open-sharedroot.org/ ** ATIX Informationstechnologie und Consulting AG Einsteinstr. 10 85716 Unterschleissheim Deutschland/Germany Phone: +49-89 452 3538-0 Fax: +49-89 990 1766-0 Registergericht: Amtsgericht Muenchen Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender des Aufsichtsrats: Dr. Martin Buss -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster