I started another test run on last week and let it run over the week end. a 3 node test was running when it hung. I set /proc/cluster/config/cman/max_retries to 9 and /proc/cluster/config/cman/hello_timer to 1 This time I hit a mount hang. The mount is hung on cl032: mount D C170F414 0 18375 18369 (NOTLB) e2dbbc20 00000082 e1dbda10 c170f414 0003e36e 00000000 00000008 c011bb10 d5ea8d58 57435700 0003e36e c18880ac e2dbbc00 e1dbda10 00000000 c170f8c0 c170ef60 00000000 000038d3 57435987 0003e36e e1dbcf50 e1dbd0b8 00000000 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f8a92ed2>] kcl_join_service+0x162/0x1a0 [cman] [<f8966fbf>] init_mountgroup+0x6f/0xc0 [lock_dlm] [<f8969411>] lm_dlm_mount+0xa1/0xf0 [lock_dlm] [<f8812355>] lm_mount+0x155/0x250 [lock_harness] [<f8affa0d>] gfs_lm_mount+0x1fd/0x390 [gfs] [<f8b0ee53>] fill_super+0x513/0x1330 [gfs] [<f8b0fe49>] gfs_get_sb+0x199/0x210 [gfs] [<c0168e4c>] do_kern_mount+0x5c/0x110 [<c0180138>] do_new_mount+0x98/0xe0 [<c0180905>] do_mount+0x165/0x1b0 [<c0180dd5>] sys_mount+0xb5/0x140 [<c010537d>] sysenter_past_esp+0x52/0x71 Looks like a problem join the mount group. /proc/cluster/services shows: [root@cl030 cman]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "stripefs" 324 693 run - [1 2 3] GFS Mount Group: "stripefs" 325 694 update U-4,1,3 [1 2 3] [root@cl031 cluster]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "stripefs" 324 457 run - [1 2 3] GFS Mount Group: "stripefs" 325 458 update U-4,1,3 [1 2 3] [root@cl032 cluster]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "stripefs" 324 225 run - [1 2 3] GFS Mount Group: "stripefs" 325 226 join S-6,20,3 [1 2 3] I collected stack traces and a bunch of other info. It is available here: http://developer.osdl.org/daniel/GFS/mount.hang.05jan2005/ Any ideas on debugging this one? Daniel