On Tue, 2004-11-23 at 03:14, Patrick Caulfield wrote: > On Tue, Nov 23, 2004 at 11:50:23AM +0800, David Teigland wrote: > > > > On Mon, Nov 22, 2004 at 12:44:07PM -0800, Daniel McNeil wrote: > > > > > The full stack traces are available here: > > > http://developer.osdl.org/daniel/gfs_umount_hang/ > > > > Thanks, it's evident that the dlm became "stuck" on the node that's not > > doing the umount. All the hung processes are blocked on the dlm's > > "in_recovery" lock. > > There also seems to be a GFS process with a failed "down_write" in dlm_unlock > which might be a clue. It's not the in_recovery lock because that's only held > for read during normal locking operations so it must be either the res_lock or > the ls_unlock_sem. odd as those are normally only held for very short time > periods. More info. I rebooted the cl031 the node not doing the umount but hung doing the cat of /proc/cluster/services. The 1st node saw the node go away, but the umount was still hung. I was expecting the recovery from the death of this node to clean up any locking problem. I rebooted the 2nd node and started the tests over again last night. This morning one node (cl030) got this: cur_state = 2, new_state = 2 Kernel panic - not syncing: GFS: Assertion failed on line 69 of file /Views/redhat-cluster/cluster/gfs-kernel/src/gfs/bits.c GFS: assertion: "valid_change[new_state * 4 + cur_state]" GFS: time = 1101174691 GFS: fsid=gfs_cluster:stripefs.0: RG = 65530 I'll upgrade to latest cvs and start the tests over. Is there anything I can do to get more info when this kind of thing happens? Thanks, Daniel