I have a 2 node cluster ( two HP DL360G7 servers) with a shared gfs2 file system located on an HP Modular Smart Array. Node1 is the 'active' server and performs almost all gfs2 access. Node2 is a 'passive' backup and rarely accesses the shared file system. Both nodes are currently running kernel-PAE-2.6.18-274.17.1.el5.i686. I am aware of the kernel updates available in the RedHat 5.8 release and have reviewed the change logs and associated bug reports, that I have access to, to determine if the handful of gfs2 changes might apply to this situation. They do not seem to apply but we plan on upgrading our production servers when we can to rule out that possibility. Intermittently (3-4 times a month) the gfs2 file system appears to lock up and any processes attempting to access it enter D state. Networking continues to function and openais is happy so no fencing occurs. Power cycling the passive node breaks the deadlock and processing on the active node will continue. During the last hang we ran the gfs2_hangalyzer tool, suggested in some older threads on the deadlock subject, to capture the dlm and glock info. I can't find explanations on what some of the fields mean so I'm hoping someone can help me interpret the results and confirm if my understanding of the output is correct or offer suggestions on how to proceed debugging further when it happens again. So far we can't come up with a reproduction scenario. I have attached the gfs2_hangalyzer summary output as hangalyzer.txt. I have the raw lock data as well if required. The tool reports that there are two glocks on which processes are waiting but no other process holds them. So it looks like a deadlock, since if no process owns them, they should have been released. The tool also reports that the two glocks were granted to two process IDs. This is an excerpt from the hangalyzer output: -------------------------------------------- There are 2 glocks with waiters. node1, pid 5380 is waiting for glock 2/85187, but no holder was found. The dlm has granted lkb " 2 85187" to pid 5021 lkb_id N RemoteID pid exflg lkbflgs stat gr rq waiting n ln resource name node1 : FS1: 3e00003 2 10c0002 5021 0 10000 grnt 5 -1 0 0 24 " 2 85187" node1 : FS1: 1501c6a 0 0 5380 0 0 wait -1 3 0 0 24 " 2 85187" node2 : FS1: G: s:EX n:2/85187 f:dyq t:EX d:SH/0 l:0 a:0 r:4 m:150 node2 : (pending demote, dirty, holder queued) node2 : FS1: I: n:1711/545159 t:8 f:0x10 d:0x00000000 s:957/957 lkb_id N RemoteID pid exflg lkbflgs stat gr rq waiting n ln resource name node2 : FS1: 10c0002 1 3e00003 5021 0 0 grnt 5 -1 0 1 24 " 2 85187" -------------------------------------------- As I understand this, on node1 the resource name "2 85187" is granted (grnt) to process 5021 on node2 while process 5380 is in wait mode on it. At the same time, node2 sees that resource name "2 85187" is granted (grnt) to process 5021 on node1. On node1, process ID 5021 is [glock_workqueue]. >From 'ps axl': 1 0 5021 67 10 -5 0 0 worker S< ? 0:07 [glock_workqueue] A similar thing occurs for resource name "2 81523". -------------------------------------------- lkb_id N RemoteID pid exflg lkbflgs stat gr rq waiting n ln resource name node1 : FS1: 2f20002 2 2970001 5021 44 10000 grnt 3 -1 0 0 24 " 2 81523" node1 : FS1: 3961d2b 0 0 5022 0 0 wait -1 5 0 0 24 " 2 81523" node2 : FS1: G: s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100 node2 : (pending demote, holder queued) node2 : FS1: I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864 lkb_id N RemoteID pid exflg lkbflgs stat gr rq waiting n ln resource name node2 : FS1: 2970001 1 2f20002 5029 44 0 grnt 3 -1 0 1 24 " 2 81523" -------------------------------------------- On node1 the resource "2 81523" is granted to process 5021 on node2, while local process 5022 waits on it. On node2, the lock appears to be granted to process 5029 from node1. On node1, process ID 5029 is [delete_workqueu]. >From 'ps axl': 1 0 5029 67 10 -5 0 0 worker S< ? 0:00 [delete_workqueu] Is my understanding of this output correct? Is there more info I need to try and gather to diagnose the issue when it happens again?
node1 : FS1: G: s:UN n:2/85187 f:lq t:SH d:EX/0 l:0 a:0 r:4 m:200 node1 : (locked, holder queued) node1 : FS1: H: s:SH f:aW e:0 p:5380 [chown] gfs2_lookup+0x42/0x8e [gfs2] lkb_id N RemoteID pid exflg lkbflgs stat gr rq waiting n ln resource name node1 : FS1: 3e00003 2 10c0002 5021 0 10000 grnt 5 -1 0 0 24 " 2 85187" node1 : FS1: 1501c6a 0 0 5380 0 0 wait -1 3 0 0 24 " 2 85187" node2 : FS1: G: s:EX n:2/85187 f:dyq t:EX d:SH/0 l:0 a:0 r:4 m:150 node2 : (pending demote, dirty, holder queued) node2 : FS1: I: n:1711/545159 t:8 f:0x10 d:0x00000000 s:957/957 lkb_id N RemoteID pid exflg lkbflgs stat gr rq waiting n ln resource name node2 : FS1: 10c0002 1 3e00003 5021 0 0 grnt 5 -1 0 1 24 " 2 85187" node1 : FS1: G: s:UN n:2/81523 f:lqO t:EX d:EX/0 l:0 a:0 r:40 m:10 node1 : (locked, holder queued, callback owed) node1 : FS1: H: s:EX f:W e:0 p:16386 [generic_templat] gfs2_createi+0x58/0xe90 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:8036 [host_status] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:16614 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:16809 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:16834 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:17536 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:17541 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18165 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18235 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18240 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18731 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18733 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18741 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18760 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18773 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18893 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:18920 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:19583 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:19630 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:19692 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:19695 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:19714 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:19718 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:20070 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:20072 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:20421 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:20428 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:20435 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:8071 [alarm_manager] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:30239 [generic_templat] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:30373 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:30836 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:30860 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:30964 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:8624 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:20485 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] node1 : FS1: H: s:SH f:aW e:0 p:22548 [gotoNuPointAWC.] gfs2_permission+0x69/0xb6 [gfs2] lkb_id N RemoteID pid exflg lkbflgs stat gr rq waiting n ln resource name node1 : FS1: 2f20002 2 2970001 5021 44 10000 grnt 3 -1 0 0 24 " 2 81523" node1 : FS1: 3961d2b 0 0 5022 0 0 wait -1 5 0 0 24 " 2 81523" node2 : FS1: G: s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100 node2 : (pending demote, holder queued) node2 : FS1: I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864 lkb_id N RemoteID pid exflg lkbflgs stat gr rq waiting n ln resource name node2 : FS1: 2970001 1 2f20002 5029 44 0 grnt 3 -1 0 1 24 " 2 81523" There are 2 glocks with waiters. node1, pid 5380 is waiting for glock 2/85187, but no holder was found. The dlm has granted lkb " 2 85187" to pid 5021 node1, pid 16386 is waiting for glock 2/81523, but no holder was found. node1, pid 8036 is waiting for glock 2/81523, but no holder was found. node1, pid 16614 is waiting for glock 2/81523, but no holder was found. node1, pid 16809 is waiting for glock 2/81523, but no holder was found. node1, pid 16834 is waiting for glock 2/81523, but no holder was found. node1, pid 17536 is waiting for glock 2/81523, but no holder was found. node1, pid 17541 is waiting for glock 2/81523, but no holder was found. node1, pid 18165 is waiting for glock 2/81523, but no holder was found. node1, pid 18235 is waiting for glock 2/81523, but no holder was found. node1, pid 18240 is waiting for glock 2/81523, but no holder was found. node1, pid 18731 is waiting for glock 2/81523, but no holder was found. node1, pid 18733 is waiting for glock 2/81523, but no holder was found. node1, pid 18741 is waiting for glock 2/81523, but no holder was found. node1, pid 18760 is waiting for glock 2/81523, but no holder was found. node1, pid 18773 is waiting for glock 2/81523, but no holder was found. node1, pid 18893 is waiting for glock 2/81523, but no holder was found. node1, pid 18920 is waiting for glock 2/81523, but no holder was found. node1, pid 19583 is waiting for glock 2/81523, but no holder was found. node1, pid 19630 is waiting for glock 2/81523, but no holder was found. node1, pid 19692 is waiting for glock 2/81523, but no holder was found. node1, pid 19695 is waiting for glock 2/81523, but no holder was found. node1, pid 19714 is waiting for glock 2/81523, but no holder was found. node1, pid 19718 is waiting for glock 2/81523, but no holder was found. node1, pid 20070 is waiting for glock 2/81523, but no holder was found. node1, pid 20072 is waiting for glock 2/81523, but no holder was found. node1, pid 20421 is waiting for glock 2/81523, but no holder was found. node1, pid 20428 is waiting for glock 2/81523, but no holder was found. node1, pid 20435 is waiting for glock 2/81523, but no holder was found. node1, pid 8071 is waiting for glock 2/81523, but no holder was found. node1, pid 30239 is waiting for glock 2/81523, but no holder was found. node1, pid 30373 is waiting for glock 2/81523, but no holder was found. node1, pid 30836 is waiting for glock 2/81523, but no holder was found. node1, pid 30860 is waiting for glock 2/81523, but no holder was found. node1, pid 30964 is waiting for glock 2/81523, but no holder was found. node1, pid 8624 is waiting for glock 2/81523, but no holder was found. node1, pid 20485 is waiting for glock 2/81523, but no holder was found. node1, pid 22548 is waiting for glock 2/81523, but no holder was found. The dlm has granted lkb " 2 81523" to pid 5029
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster