On Mon, Jan 10, 2005 at 04:50:20PM -0800, Daniel McNeil wrote: > I collected stack traces and a bunch of other info. It is > available here: > http://developer.osdl.org/daniel/GFS/mount.hang.05jan2005/ > > Any ideas on debugging this one? - Processes on cl032 and cl030 are blocked waiting for dlm responses from cl031. - Processes on cl031 are blocked waiting for dlm responses to resource directory lookups (looking up unknown resource masters for 10,0 and 3,11). - It looks like dlm_recvd may be stuck on cl031 preventing it from receiving the requests from the other two nodes and preventing it from receiving the responses to its own lookup requests. This is probably the crux of the problem. Unfortunately, all we see for dlm_recvd on cl031 (from stack.cl031) is: dlm_recvd R running 0 29053 6 29054 29052 (L-TLB) cl032 - requesting PR on 10,1 (mounting) ---------------------------------------- lock_dlm2 D C170F414 0 18399 4 18398 (L-TLB) e6a1fe04 00000046 e7639930 c170f414 0003e36e 00000018 00000008 00000000 d5ea8d58 7505db9d 0003e36e db8ff348 e6a1fdf8 e7639930 00000000 c170f8c0 c170ef60 00000000 000138a5 7505df29 0003e36e f4377170 f43772d8 00000000 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f8968139>] lm_dlm_lock_sync+0x59/0x70 [lock_dlm] [<f8966163>] id_test_and_set+0xa3/0x260 [lock_dlm] [<f8966597>] claim_jid+0x47/0x120 [lock_dlm] [<f8966c3d>] process_start+0x46d/0x610 [lock_dlm] [<f896ca54>] dlm_async+0x274/0x3c0 [lock_dlm] [<c0134cca>] kthread+0xba/0xc0 [<c0103325>] kernel_thread_helper+0x5/0x10 cl031 - requesting PR on 10,0 ----------------------------- lock_dlm1 D C170EF9C 0 29065 6 29066 29054 (L-TLB) d2e0ede8 00000046 f76d3850 c170ef9c 0003e354 00000018 00000008 00000000 f6750838 30672ddf 0003e354 dbf900dc d2e0eddc f76d3850 00000000 c170f8c0 c170ef60 00000000 0002088a 306734a4 0003e354 f64d8710 f64d8878 00000000 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f8968139>] lm_dlm_lock_sync+0x59/0x70 [lock_dlm] [<f8966443>] id_value+0x93/0x130 [lock_dlm] [<f896650f>] id_find+0x2f/0x70 [lock_dlm] [<f896670a>] discover_jids+0x6a/0xa0 [lock_dlm] [<f8966ab8>] process_start+0x2e8/0x610 [lock_dlm] [<f896ca54>] dlm_async+0x274/0x3c0 [lock_dlm] [<c0134cca>] kthread+0xba/0xc0 [<c0103325>] kernel_thread_helper+0x5/0x10 cl031 - requesting NL on 3,11 ----------------------------- df D 00000008 0 29088 29086 (NOTLB) dd0e5c14 00000082 dd0e5c04 00000008 00000001 f8b3b571 00000008 dd0e5c0c ecb0a568 dbf9002c d6e5415c 00000008 dd0e5c44 00000018 00000000 00000000 c170ef60 00000000 00000fec 4d5f5234 0003e3a1 f6789190 f67892f8 dd0e5c44 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f896804b>] do_dlm_lock_sync+0x4b/0x60 [lock_dlm] [<f89683d4>] hold_null_lock+0xb4/0xd0 [lock_dlm] [<f8968470>] lm_dlm_hold_lvb+0x40/0x50 [lock_dlm] [<f8afff2c>] gfs_lm_hold_lvb+0x3c/0x50 [gfs] [<f8af49a1>] gfs_lvb_hold+0x41/0xe0 [gfs] [<f8b19c13>] gfs_ri_update+0x1d3/0x250 [gfs] [<f8b19d78>] gfs_rindex_hold+0xe8/0x100 [gfs] [<f8b1d781>] gfs_stat_gfs+0x21/0x80 [gfs] [<f8b131e0>] gfs_statfs+0x30/0xd0 [gfs] [<c015e8ac>] vfs_statfs+0x4c/0x70 [<c015e9cb>] vfs_statfs64+0x1b/0x50 [<c015eb07>] sys_statfs64+0x67/0xa0 [<c010537d>] sysenter_past_esp+0x52/0x71 cl030 - requesting PR on 10,1 ----------------------------- lock_dlm2 D 00000008 0 14338 6 14337 (L-TLB) cf1b4de8 00000046 cf1b4dd8 00000008 00000001 00000018 00000008 00000000 f600ec98 00000000 00000000 cbe5ed24 cf1b4ddc 00000000 f7b82054 cf1b4df8 c170ef60 00000000 00014966 b62fc6b6 00009f97 f6610730 f6610898 00000009 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f8b57139>] lm_dlm_lock_sync+0x59/0x70 [lock_dlm] [<f8b55443>] id_value+0x93/0x130 [lock_dlm] [<f8b5550f>] id_find+0x2f/0x70 [lock_dlm] [<f8b5570a>] discover_jids+0x6a/0xa0 [lock_dlm] [<f8b55ab8>] process_start+0x2e8/0x610 [lock_dlm] [<f8b5ba54>] dlm_async+0x274/0x3c0 [lock_dlm] [<c0134cca>] kthread+0xba/0xc0 [<c0103325>] kernel_thread_helper+0x5/0x10 cl030 - requesting NL on 3,11 ----------------------------- df D 00000008 0 14362 14360 (NOTLB) d10a3c14 00000086 d10a3c04 00000008 00000001 f8b3b571 00000008 d10a3c0c f6b89818 cbe5ec74 c2015b28 00000008 d10a3c44 00000018 00000000 00000000 c170ef60 00000000 000305ef f0cf7f52 00009fe4 da6f0f10 da6f1078 d10a3c44 Call Trace: [<c03dbac4>] wait_for_completion+0xa4/0xe0 [<f8b5704b>] do_dlm_lock_sync+0x4b/0x60 [lock_dlm] [<f8b573d4>] hold_null_lock+0xb4/0xd0 [lock_dlm] [<f8b57470>] lm_dlm_hold_lvb+0x40/0x50 [lock_dlm] [<f8afff2c>] gfs_lm_hold_lvb+0x3c/0x50 [gfs] [<f8af49a1>] gfs_lvb_hold+0x41/0xe0 [gfs] [<f8b19c13>] gfs_ri_update+0x1d3/0x250 [gfs] [<f8b19d78>] gfs_rindex_hold+0xe8/0x100 [gfs] [<f8b1d781>] gfs_stat_gfs+0x21/0x80 [gfs] [<f8b131e0>] gfs_statfs+0x30/0xd0 [gfs] [<c015e8ac>] vfs_statfs+0x4c/0x70 [<c015e9cb>] vfs_statfs64+0x1b/0x50 [<c015eb07>] sys_statfs64+0x67/0xa0 [<c010537d>] sysenter_past_esp+0x52/0x71 cl032 (nodeid 3, mounting and looking for free jid) --------------------------------------------------- Resource dfdbf26c (parent 00000000). Name (len=24) " 10 1" Local Copy, Master is node 2 Granted Queue Conversion Queue Waiting Queue 000102aa -- (PR) Master: 00000000 LQ: 3,0x9 (pid 18399) cl031 (nodeid 2, jid 1) ----------------------- Resource cc0100a4 (parent 00000000). Name (len=24) " 10 1" Master Copy LVB: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Granted Queue 000100d5 PR (pid 29066) Conversion Queue Waiting Queue Resource e16fe26c (parent 00000000). Name (len=24) " 10 0" Local Copy, Master is node -1 Granted Queue Conversion Queue Waiting Queue Resource e4b5573c (parent 00000000). Name (len=24) " 3 11" Local Copy, Master is node -1 Granted Queue Conversion Queue Waiting Queue cl030 (nodeid 1, jid 0) ----------------------- Resource cfb9054c (parent 00000000). Name (len=24) " 10 0" Master Copy LVB: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Granted Queue 000102c3 PR (pid 14338) Conversion Queue Waiting Queue Resource d798911c (parent 00000000). Name (len=24) " 10 1" Local Copy, Master is node 2 Granted Queue Conversion Queue Waiting Queue 000103b7 -- (PR) Master: 00000000 LQ: 3,0x9 (pid 14338) Resource d38d7b2c (parent 00000000). Name (len=24) " 3 11" Local Copy, Master is node 2 Granted Queue Conversion Queue Waiting Queue 0002022e -- (NL) Master: 00000000 LQ: 3,0x8 (pid 14362) -- Dave Teigland <teigland@xxxxxxxxxx>