I have some locking issues (deadlocks?) with GFS.
My configuration include 4 hosts - one of them is used as GNBD-device
exporter and 3 other import this GNBD partition and mount it to the /gfs
mountpoint.
LVM is also used on the imported GNBD partition, so clmvd is running.
The locking method is DLM, GFS version is 6.1.5, manual fencing used.
The problem is quite usual - deadlock on httpd (httpd processess in 'D' state)
I saw such problems, though not solutions on the list.
In my case apache is placed to the GFS filesystem and I run it inside th
chroot by the command like this:
chroot /gfs/chroot /usr/local/apache/bin/httpd
The problem appears sometimes after "killall httpd" - all the httpd processes
get the 'D' state in "ps ax" terms and become locked in this state forever.
Moreover the whole GFS filesystem become unavailable after it happens.
Even from another host every command that tries to access /gfs partition
hangs in the 'D' state. Though last time it was unavailable only partially
- the /gfs/chroot/usr hierarchy was "locked" but other parts of gfs worked
okay.
The only cure I know is to reboot the node and fence it out from the cluster.
Is there any ideas of how to fix this? I mean either the reason ('D' state of
killed httpd-s) or consequences (the GFS filesystem fully or partially
become unavailable after this).
I also appreciate any help with debugging the problem.
I tried gfs_tool lockdump with decipher_lockstate_dump tool.
bash-3.00# ps ax |grep http
14981 ? Ds 0:00 /usr/system/apache/bin/httpd
15242 ? D 0:00 /usr/system/apache/bin/httpd
24708 ? D 0:00 /usr/system/apache/bin/httpd
24709 ? D 0:00 /usr/system/apache/bin/httpd
24710 ? D 0:00 /usr/system/apache/bin/httpd
I found only 2 locks regarding these processes:
bash-3.00# ls -i /gfs/chroot/lib64/libnss_files-2.3.4.so
27190 /gfs/chroot/lib64/libnss_files-2.3.4.so
Glock (inode[2], 27190)
gl_flags = lock[1]
gl_count = 7
gl_state = shared[3]
req_gh = yes
req_bh = yes
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 1
ail_bufs = no
Request
owner = 24710
gh_state = shared[3]
gh_flags =
error = 0
gh_iflags = promote[1] holder[6] first[7]
Holder
owner = 24710
gh_state = shared[3]
gh_flags =
error = 0
gh_iflags = promote[1] holder[6] first[7]
Waiter3
owner = 24708
gh_state = shared[3]
gh_flags =
error = 0
gh_iflags = promote[1]
Waiter3
owner = 24709
gh_state = shared[3]
gh_flags =
error = 0
gh_iflags = promote[1]
Waiter3
owner = 15242
gh_state = shared[3]
gh_flags =
error = 0
gh_iflags = promote[1]
Inode: busy
and
bash-3.00# ls -i /gfs/chroot/usr/system/apache/bin/httpd
2175961 /gfs/chroot/usr/system/apache/bin/httpd
Glock (inode[2], 2175961)
gl_flags =
gl_count = 4
gl_state = shared[3]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 1
ail_bufs = no
Holder
owner = 14981
gh_state = shared[3]
gh_flags =
error = 0
gh_iflags = promote[1] holder[6] first[7]
Inode: busy
There are also such locks for this inodes:
Glock (iopen[5], 27190)
gl_flags =
gl_count = 2
gl_state = shared[3]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = no
ail_bufs = no
Holder
owner = none[-1]
gh_state = shared[3]
gh_flags = local_excl[5] exact[7]
error = 0
gh_iflags = promote[1] holder[6] first[7]
Glock (iopen[5], 2175961)
gl_flags =
gl_count = 2
gl_state = shared[3]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = no
ail_bufs = no
Holder
owner = none[-1]
gh_state = shared[3]
gh_flags = local_excl[5] exact[7]
error = 0
gh_iflags = promote[1] holder[6] first[7]
During the last hanging the "/gfs/chroot/usr" was unavailable and there are two entries regarding this directory in the lockdump:
bash-3.00# ls -di /gfs/chroot/usr/
15077981 /gfs/chroot/usr/
Glock (inode[2], 15077981)
gl_flags =
gl_count = 4
gl_state = shared[3]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = yes
aspace = 1
ail_bufs = no
Inode:
num = 15077981/15077981
type = directory[2]
i_count = 1
i_flags =
vnode = yes
Glock (iopen[5], 15077981)
gl_flags =
gl_count = 2
gl_state = shared[3]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = no
ail_bufs = no
Holder
owner = none[-1]
gh_state = shared[3]
gh_flags = local_excl[5] exact[7]
error = 0
gh_iflags = promote[1] holder[6] first[7]
Your comments will be highly appreciated.
--
Best Regards,
Anton Kornev.
-- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster