Wendy Cheng wrote:
Mark Hlawatschek wrote:
Hi,
we have the following deadlock situation:
2 node cluster consisting of node1 and node2. /usr/local is placed on
a GFS filesystem mounted on both nodes. Lockmanager is dlm.
We are using RHEL4u4
a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
lstat("/usr/local/swadmin/mnx/xml",
This happens on both cluster nodes.
All processes trying to access the directory
/usr/local/swadmin/mnx/xml are in "Waiting for IO (D)" state. I.e.
system load is at about 400 ;-)
Any ideas ?
Quickly browsing this, look to me that process with pid=5856 got
stuck. That process had the file or directory (ino number 627732 -
probably /usr/local/swadmin/mnx/xml) exclusive lock so everyone was
waiting for it. The faulty process was apparently in the middle of
obtaining another exclusive lock (and almost got it). We need to know
where pid=5856 was stuck at that time. If this occurs again, could you
use "crash" to back trace that process and show us the output ?
Or an "echo t > /proc/sysrq-trigger" to obtain *all* threads backtrace
would be better - but it has the risk of missing heartbeat that could
result cluster fence action since sysrq-t could stall the system for a
while.
-- Wendy
a lockdump analysis with the decipher_lockstate_dump and
parse_lockdump shows the following output (The whole file is too
large for the mailing-list):
Entries: 101939
Glocks: 60112
PIDs: 751
4 chain:
lockdump.node1.dec Glock (inode[2], 1114343)
gl_flags = lock[1]
gl_count = 5
gl_state = shared[3]
req_gh = yes
req_bh = yes
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 1
ail_bufs = no
Request
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1]
Waiter3
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1]
Inode: busy
lockdump.node2.dec Glock (inode[2], 1114343)
gl_flags =
gl_count = 2
gl_state = unlocked[0]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 0
ail_bufs = no
Inode:
num = 1114343/1114343
type = regular[1]
i_count = 1
i_flags =
vnode = yes
lockdump.node1.dec Glock (inode[2], 627732)
gl_flags = dirty[5]
gl_count = 379
gl_state = exclusive[1]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 58
ail_bufs = no
Holder
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1] holder[6] first[7]
Waiter2
owner = none[-1]
gh_state = shared[3]
gh_flags = try[0]
error = 0
gh_iflags = demote[2] alloced[4] dealloc[5]
Waiter3
owner = 32753
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
[...loads of Waiter3 entries...]
Waiter3
owner = 4566
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Inode: busy
lockdump.node2.dec Glock (inode[2], 627732)
gl_flags = lock[1]
gl_count = 375
gl_state = unlocked[0]
req_gh = yes
req_bh = yes
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 0
ail_bufs = no
Request
owner = 20187
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Waiter3
owner = 20187
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
[...loads of Waiter3 entries...]
Waiter3
owner = 10460
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Inode: busy
2 requests
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster