Hi, On Mon, 2008-12-01 at 16:46 -0600, Brian Kroth wrote: > Given the recent discussion of GFS2's stability I thought I'd chime in > with a problem test case. > > I've noticed a deadlock in the following situation: > > 3 node Debian (Lenny) cluster of esx based vm nodes using either fibre > channel or open-iscsi based storage. Version 2.03.06 on the > redhat-cluster-suite software, 0.80.3 openais, and 2.6.26 on the kernel. > I'm not that familar with the Debian kernel, so I don't know what fixes might have been added recently. You might find that the problem goes away if you upgrade to a more recent kernel, however... > cssh node1 node2 node3 > cd /gfs2/ > mkdir $HOSTNAME > echo $HOSTNAME > $HOSTNAME/test > rm -rf * > > The last command generally deadlocks at least one of the machines. Any > access attempts to the /gfs2 volume simply hang. No logs in dmesg, > messages, etc. On a few occasions about 24 hours later it'll get > fenced, but usually it's just stuck indefinitely. I haven't had a > chance to look into this in much more depth since I had to get something > running so I just went back to OCFS2. I now have an opportunity to test > with things again, so if someone would like more information or could > possibly tell me what's wrong that would be nice. > > Thanks, > Brian > The first thing to check is that you have debugfs mounted on each node. You can then look at the glock dumps which are located under /sys/kernel/debug/gfs2/<fsname>/glocks. There are a number of lines in this file, each relating to a particular glock. Lines starting G: relate to a glock, and lines below that, indented by a single space also relate to that same glock. H: lines relate to the holders of that glock, and if you look at the flags field which starts f: then you can see if any of the holders are waiting for a lock (look for the W (wait) flag). The holders are listed in order, granted holders first (if any) and then waiting holders (if any). So the only interesting holder in this case will be one with a W flag set thats nearest to its associated glock. Looking back at the associated G: line, there are various lock modes listed. The s: field shows the current state of the glock. The t: state shows the target state. The target state is only of interest if the l (locked) flag is set on the glock itself (again f: is the flags field). In that case it tells you that there is a remote lock request in progress (i.e. a request has been sent to the DLM) to convert from the current lock mode (s:) to the target lock mode (t:). Demote requests are issued from the DLM when it receives a lock request which conflicts with an existing holder. In that case, the D flag is set on the glock and the d: field shows the state which has been requested along with the time (in jiffies) since the demote request was received. I know all that sounds quite complicated, but in fact its usually pretty easy to find the cause of deadlocks. It is usually just a matter of first tracking down holders (H:) which are first in the queue (i.e. immediately after a G:) with the W flag set, and then looking at the lock with the same number (the n: field of the G: line) across the cluster to see which node is still holding that lock (i.e. s: is not UN) and then checking the remaining flags to see why that is the case. There is a tool which does some of this automatically, although I've not tried it myself as I tend to use the manual method still. If you get stuck then please file a bug (just file it against Fedora/rawhide and mark it as Debian in the comments somewhere, so we know which kernel it is) and attach the glock dumps to it and then we can take a look at it. I have it on my TODO list to write this up properly at some stage and turn it into a GFS2 debugging FAQ or something like that. At the moment the only documentation on glocks is the linux-2.6/Documentation/filesystems/gfs2-glocks.txt file, although thats aimed more at developers than users, I'm afraid, Steve. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster