[Linux-cluster] mount hang during test runs

Daniel McNeil <daniel@xxxxxxxx> · Mon, 10 Jan 2005 16:50:20 -0800

I started another test run on last week and let it run
over the week end.  a 3 node test was running when it hung.

I set /proc/cluster/config/cman/max_retries to 9
and /proc/cluster/config/cman/hello_timer to 1

This time I hit a mount hang.  The mount is hung on cl032:

mount         D C170F414     0 18375  18369                     (NOTLB)
e2dbbc20 00000082 e1dbda10 c170f414 0003e36e 00000000 00000008 c011bb10
       d5ea8d58 57435700 0003e36e c18880ac e2dbbc00 e1dbda10 00000000 c170f8c0
       c170ef60 00000000 000038d3 57435987 0003e36e e1dbcf50 e1dbd0b8 00000000
Call Trace:
 [<c03dbac4>] wait_for_completion+0xa4/0xe0
 [<f8a92ed2>] kcl_join_service+0x162/0x1a0 [cman]
 [<f8966fbf>] init_mountgroup+0x6f/0xc0 [lock_dlm]
 [<f8969411>] lm_dlm_mount+0xa1/0xf0 [lock_dlm]
 [<f8812355>] lm_mount+0x155/0x250 [lock_harness]
 [<f8affa0d>] gfs_lm_mount+0x1fd/0x390 [gfs]
 [<f8b0ee53>] fill_super+0x513/0x1330 [gfs]
 [<f8b0fe49>] gfs_get_sb+0x199/0x210 [gfs]
 [<c0168e4c>] do_kern_mount+0x5c/0x110
 [<c0180138>] do_new_mount+0x98/0xe0
 [<c0180905>] do_mount+0x165/0x1b0
 [<c0180dd5>] sys_mount+0xb5/0x140
 [<c010537d>] sysenter_past_esp+0x52/0x71

Looks like a problem join the mount group.

/proc/cluster/services shows:

[root@cl030 cman]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "stripefs"                        324 693 run       -
[1 2 3]

GFS Mount Group: "stripefs"                        325 694 update    U-4,1,3
[1 2 3]

[root@cl031 cluster]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "stripefs"                        324 457 run       -
[1 2 3]

GFS Mount Group: "stripefs"                        325 458 update    U-4,1,3
[1 2 3]

[root@cl032 cluster]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "stripefs"                        324 225 run       -
[1 2 3]

GFS Mount Group: "stripefs"                        325 226 join      S-6,20,3
[1 2 3]

I collected stack traces and a bunch of other info.  It is
available here:
http://developer.osdl.org/daniel/GFS/mount.hang.05jan2005/

Any ideas on debugging this one?

Daniel