Mark Hlawatschek wrote:
Hi,
we have the following deadlock situation:
2 node cluster consisting of node1 and node2.
/usr/local is placed on a GFS filesystem mounted on both nodes.
Lockmanager is dlm.
We are using RHEL4u4
a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
lstat("/usr/local/swadmin/mnx/xml",
This happens on both cluster nodes.
All processes trying to access the directory /usr/local/swadmin/mnx/xml are
in "Waiting for IO (D)" state. I.e. system load is at about 400 ;-)
Any ideas ?
Quickly browsing this, look to me that process with pid=5856 got stuck.
That process had the file or directory (ino number 627732 - probably
/usr/local/swadmin/mnx/xml) exclusive lock so everyone was waiting for
it. The faulty process was apparently in the middle of obtaining another
exclusive lock (and almost got it). We need to know where pid=5856 was
stuck at that time. If this occurs again, could you use "crash" to back
trace that process and show us the output ?
-- Wendy
a lockdump analysis with the decipher_lockstate_dump and parse_lockdump shows
the following output (The whole file is too large for the mailing-list):
Entries: 101939
Glocks: 60112
PIDs: 751
4 chain:
lockdump.node1.dec Glock (inode[2], 1114343)
gl_flags = lock[1]
gl_count = 5
gl_state = shared[3]
req_gh = yes
req_bh = yes
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 1
ail_bufs = no
Request
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1]
Waiter3
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1]
Inode: busy
lockdump.node2.dec Glock (inode[2], 1114343)
gl_flags =
gl_count = 2
gl_state = unlocked[0]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 0
ail_bufs = no
Inode:
num = 1114343/1114343
type = regular[1]
i_count = 1
i_flags =
vnode = yes
lockdump.node1.dec Glock (inode[2], 627732)
gl_flags = dirty[5]
gl_count = 379
gl_state = exclusive[1]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 58
ail_bufs = no
Holder
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1] holder[6] first[7]
Waiter2
owner = none[-1]
gh_state = shared[3]
gh_flags = try[0]
error = 0
gh_iflags = demote[2] alloced[4] dealloc[5]
Waiter3
owner = 32753
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
[...loads of Waiter3 entries...]
Waiter3
owner = 4566
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Inode: busy
lockdump.node2.dec Glock (inode[2], 627732)
gl_flags = lock[1]
gl_count = 375
gl_state = unlocked[0]
req_gh = yes
req_bh = yes
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 0
ail_bufs = no
Request
owner = 20187
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Waiter3
owner = 20187
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
[...loads of Waiter3 entries...]
Waiter3
owner = 10460
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Inode: busy
2 requests
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster