Wendy Cheng wrote:
Mark Hlawatschek wrote:
Hi Wendy,
thanks for your answer!
The system is still in the deadlock state, so I hopefully can collect
all information you need :-) (you'll find the crash output below)
Thanks,
So it is removing a file. It has obtained the directory lock and is
waiting for the file lock. Look to me DLM (LM_CB_ASYNC) callback never
occurs. Do you have abnormal messages in your /var/log/messages file ?
Dave, how to dump the locks from DLM side to see how DLM is thinking ?
Sorry, stepped out for lunch - was hoping Dave would take over this :)
... anyway, please dump DLM locks as the following:
shell> cman_tool services /* find your lock space name */
shell> echo "lock-space-name-found-above" > /proc/cluster/dlm_locks
shell> cat /proc/cluster/dlm_locks
Then try to find the lock (2, hex of (1114343)) - cut and paste the
contents of that file here.
-- Wendy
|
|
we have the following deadlock situation:
2 node cluster consisting of node1 and node2.
/usr/local is placed on a GFS filesystem mounted on both nodes.
Lockmanager is dlm.
We are using RHEL4u4
a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
lstat("/usr/local/swadmin/mnx/xml",
This happens on both cluster nodes.
All processes trying to access the directory
/usr/local/swadmin/mnx/xml
are in "Waiting for IO (D)" state. I.e. system load is at about 400
;-)
Any ideas ?
Quickly browsing this, look to me that process with pid=5856 got stuck.
That process had the file or directory (ino number 627732 - probably
/usr/local/swadmin/mnx/xml) exclusive lock so everyone was waiting for
it. The faulty process was apparently in the middle of obtaining
another
exclusive lock (and almost got it). We need to know where pid=5856 was
stuck at that time. If this occurs again, could you use "crash" to back
trace that process and show us the output ?
Here's the crash output:
crash> bt 5856
PID: 5856 TASK: 10bd26427f0 CPU: 0 COMMAND: "java"
#0 [10bd20cfbc8] schedule at ffffffff8030a1d1
#1 [10bd20cfca0] wait_for_completion at ffffffff8030a415
#2 [10bd20cfd20] glock_wait_internal at ffffffffa018574e
#3 [10bd20cfd60] gfs_glock_nq_m at ffffffffa01860ce
#4 [10bd20cfda0] gfs_unlink at ffffffffa019ce41
#5 [10bd20cfea0] vfs_unlink at ffffffff801889fa
#6 [10bd20cfed0] sys_unlink at ffffffff80188b19
#7 [10bd20cff30] filp_close at ffffffff80178e48
#8 [10bd20cff50] error_exit at ffffffff80110d91
RIP: 0000002a9593f649 RSP: 0000007fbfffbca0 RFLAGS: 00010206
RAX: 0000000000000057 RBX: ffffffff8011026a RCX: 0000002a9cc9c870
RDX: 0000002ae5989000 RSI: 0000002a962fa3a8 RDI: 0000002ae5989000
RBP: 0000000000000000 R8: 0000002a9630abb0 R9: 0000000000000ffc
R10: 0000002a9630abc0 R11: 0000000000000206 R12: 0000000040115700
R13: 0000002ae23294b0 R14: 0000007fbfffc300 R15: 0000002ae5989000
ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b
a lockdump analysis with the decipher_lockstate_dump and
parse_lockdump
shows the following output (The whole file is too large for the
mailing-list):
Entries: 101939
Glocks: 60112
PIDs: 751
4 chain:
lockdump.node1.dec Glock (inode[2], 1114343)
gl_flags = lock[1]
gl_count = 5
gl_state = shared[3]
req_gh = yes
req_bh = yes
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 1
ail_bufs = no
Request
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1]
Waiter3
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1]
Inode: busy
lockdump.node2.dec Glock (inode[2], 1114343)
gl_flags =
gl_count = 2
gl_state = unlocked[0]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 0
ail_bufs = no
Inode:
num = 1114343/1114343
type = regular[1]
i_count = 1
i_flags =
vnode = yes
lockdump.node1.dec Glock (inode[2], 627732)
gl_flags = dirty[5]
gl_count = 379
gl_state = exclusive[1]
req_gh = no
req_bh = no
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 58
ail_bufs = no
Holder
owner = 5856
gh_state = exclusive[1]
gh_flags = try[0] local_excl[5] async[6]
error = 0
gh_iflags = promote[1] holder[6] first[7]
Waiter2
owner = none[-1]
gh_state = shared[3]
gh_flags = try[0]
error = 0
gh_iflags = demote[2] alloced[4] dealloc[5]
Waiter3
owner = 32753
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
[...loads of Waiter3 entries...]
Waiter3
owner = 4566
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Inode: busy
lockdump.node2.dec Glock (inode[2], 627732)
gl_flags = lock[1]
gl_count = 375
gl_state = unlocked[0]
req_gh = yes
req_bh = yes
lvb_count = 0
object = yes
new_le = no
incore_le = no
reclaim = no
aspace = 0
ail_bufs = no
Request
owner = 20187
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Waiter3
owner = 20187
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
[...loads of Waiter3 entries...]
Waiter3
owner = 10460
gh_state = shared[3]
gh_flags = any[3]
error = 0
gh_iflags = promote[1]
Inode: busy
2 requests
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster