[Linux-cluster] umount hang

Daniel McNeil <daniel@xxxxxxxx> · Mon, 22 Nov 2004 12:44:07 -0800

I left some automated tests running over the weekend and
ran into a umount hang.

A single GFS file system was mounted on 2 nodes of a 3 node
cluster.  The test had just removed 2 subdirectories - one
from each node.  The test was then unmounting the file system
from one node when the umount hung.

Here's a stack trace from the hung umount (on cl030):

(node cl030)
umount        D 00000008     0 14345  14339                     (NOTLB)
db259e04 00000086 db259df4 00000008 00000001 00000000 00000008 db259dc8
       eda96dc0 f15d0750 c044aac0 db259000 db259de4 c01196d1 f7cf0b90
450fa673
       c170df60 00000000 00049d65 44bb3183 0002dfe0 f15d0750 f15d08b0
c170df60
Call Trace:
 [<c03d39d4>] wait_for_completion+0xa4/0xe0
 [<f8aba97e>] kcl_leave_service+0xfe/0x180 [cman]
 [<f8b06756>] release_lockspace+0x2d6/0x2f0 [dlm]
 [<f8a9010c>] release_gdlm+0x1c/0x30 [lock_dlm]
 [<f8a903f4>] lm_dlm_unmount+0x24/0x50 [lock_dlm]
 [<f881e496>] lm_unmount+0x46/0xac [lock_harness]
 [<f8b8089f>] gfs_put_super+0x30f/0x3c0 [gfs]
 [<c01654fa>] generic_shutdown_super+0x18a/0x1a0
 [<c016608d>] kill_block_super+0x1d/0x40
 [<c01652a1>] deactivate_super+0x81/0xa0
 [<c017c6cc>] sys_umount+0x3c/0xa0
 [<c017c749>] sys_oldumount+0x19/0x20
 [<c010537d>] sysenter_past_esp+0x52/0x71

[root@cl030 proc]# cat /proc/cluster/services
Service          Name                              GID LID State    
Code
Fence Domain:    "default"                           1   2 run       -
[3 1 2]

DLM Lock Space:  "stripefs"                        222 275 run      
S-13,210,1
[1 3]

Cat'ing /proc/cluster/services on the 2nd node (cl031) hangs.
[root@cl031 root]# cat /proc/cluster/services

>From the 2nd node (cl031).  Here are some stack traces that
might be interesting:

cman_serviced D 00000008     0  3818      6         12593   665 (L-TLB)
ebc23edc 00000046 ebc23ecc 00000008 00000001 00000010 00000008 00000002
       f7726dc0 00000000 00000000 f5a4b230 00000000 00000010 00000010 ebc23f24
       c170df60 00000000 000005a8 d42bcdab 0002e201 eb5119f0 eb511b50 ebc23f08
Call Trace:
 [<c03d409c>] rwsem_down_write_failed+0x9c/0x18e
 [<f8b06acb>] .text.lock.lockspace+0x4e/0x63 [dlm]
 [<f8a8daa2>] process_leave_stop+0x32/0x80 [cman]
 [<f8a8dcf2>] process_one_uevent+0xc2/0x100 [cman]
 [<f8a8e798>] process_membership+0xc8/0xca [cman]
 [<f8a8bf65>] serviced+0x165/0x1d0 [cman]
 [<c013426a>] kthread+0xba/0xc0
 [<c0103325>] kernel_thread_helper+0x5/0x10

cat /proc/cluster/services stack trace:

cat           D 00000008     0 22151      1               13435 (NOTLB)
c1f7ae90 00000086 c1f7ae7c 00000008 00000002 000000d0 00000008 c1f7ae74
       eb0acdc0 00000001 00000246 00000000 e20c4670 f474f1d0 00000000 c17168c0
       c1715f60 00000001 00159c05 bad07454 0003aa83 e20c4670 e20c47d0 00000000
Call Trace:
 [<c03d2b03>] __down+0x93/0xf0
 [<c03d2c93>] __down_failed+0xb/0x14
 [<f8a9053c>] .text.lock.sm_misc+0x2d/0x41 [cman]
 [<f8a90144>] sm_seq_next+0x34/0x50 [cman]
 [<c017e629>] seq_read+0x159/0x2b0
 [<c015e49f>] vfs_read+0xaf/0x120
 [<c015e74b>] sys_read+0x4b/0x80
 [<c010537d>] sysenter_past_esp+0x52/0x71

The full stack traces are available here:
http://developer.osdl.org/daniel/gfs_umount_hang/

I'm running on 2.6.9 and cvs code from Nov 9th.

Any ideas?

Daniel