Re: gfs2 blocking tasks

Bart Verwilst <lists@xxxxxxxxxxx> · Tue, 21 Aug 2012 12:39:11 +0200

Hi Steven

Shared storage is iSCSI,

<totem rrp_mode="none" secauth="off" token="20000"/>
<quorumd tko="4" interval="2" 
device="/dev/mapper/iscsi_cluster_quorum"></quorumd>

Actually i know why this is happening now, and can reproduce 100% of 
the time, i've added my findings as a comment to this bug from somebody 
having the same problem:

https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207

Create file on one node's gfs2 mount, rm on the other -> hanging 
mountpoint + kernel OOPS.

Happy that i'm finally getting somewhere with this :P

Anything i can do to help Steven?

Kind regards,

Bart Verwilst

Steven Whitehouse schreef op 21.08.2012 12:17:
Hi,

On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote:
As yet another reply to my own post, i found this on the node where 
it
hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's 
hanging
):

[ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm",
"kvm:sanlock"
[ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now 
mounting
FS...
[ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked for 
use
[ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at 
journal...
[ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the
transaction lock...
[ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying journal...
[ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0 
blocks
[ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke tags
[ 1219.782611] init: libvirt-bin main process ended, respawning
[ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed in 
1s
[ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done
[ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire
journal lock...
[ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at 
journal...
[ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done
[ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire
journal lock...
[ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at 
journal...
[ 1219.943994] init: ttyS1 main process (20318) terminated with 
status
1
[ 1219.944037] init: ttyS1 main process ended, respawning
[ 1219.967054] init: ttyS0 main process (20320) terminated with 
status
1
[ 1219.967100] init: ttyS0 main process ended, respawning
[ 1219.972037] ttyS0: LSR safety check engaged!
[ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the
transaction lock...
[ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying journal...
[ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106
blocks
[ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke 
tags
[ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed in 
1s
[ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done

So that looks like it successfully recovered the journals for nodes 
one
and two. How many nodes are in the cluster? What is the fencing 
quorum
set up being used?

[ 1221.457120] BUG: unable to handle kernel NULL pointer dereference 
at
0000000000000018

So this is a dereference of something which is 24 bytes into some
structure or other. Certainly something which should not happen so we
need to take a look at that.

Was this a one off, or something that you can reproduce?

Steve.

[ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 
[gfs2]
[ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0
[ 1221.458197] Oops: 0000 [#1] SMP
[ 1221.458374] CPU 0
[ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase
ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables
iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel kvm
ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager 
ocfs2_stackglue
configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa 
ib_mad
ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q garp 
stp
joydev dm_multipath dcdbas mac_hid i7core_edac edac_core
acpi_power_meter lp parport usbhid hid bnx2 mpt2sas 
scsi_transport_sas
e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: ipmi_si]
[ 1221.463058]
[ 1221.463146] Pid: 20611, comm: libvirtd Not tainted 
3.2.0-26-generic
#41-Ubuntu Dell Inc. PowerEdge R310/05XKKK
[ 1221.463626] RIP: 0010:[<ffffffffa04f800a>]  [<ffffffffa04f800a>]
gfs2_unlink+0x8a/0x220 [gfs2]
[ 1221.464100] RSP: 0018:ffff88020cfe1d28  EFLAGS: 00010296
[ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX:
ffff88020cfe1d40
[ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI:
ffff88022efa2440
[ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09:
0000000000000000
[ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12:
ffff88021ef50000
[ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15:
ffff88022efa2000
[ 1221.466115] FS:  00007f4d2c0e7700(0000) GS:ffff880237200000(0000)
knlGS:0000000000000000
[ 1221.466487] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4:
00000000000006f0
[ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 1221.467702] Process libvirtd (pid: 20611, threadinfo
ffff88020cfe0000, task ffff8802241cdbc0)
[ 1221.468091] Stack:
[ 1221.468199]  0000000000000003 ffff88022f108048 ffff88020cfe1d58
ffff88020cfe1d40
[ 1221.468581]  ffff88020cfe1d40 ffff88022f108000 ffff88022354aa00
0000000000000001
[ 1221.468963]  0000000000000000 0000000000000000 ffffffffa04f7fda
ffff88020cfe1d80
[ 1221.469346] Call Trace:
[ 1221.469486]  [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 [gfs2]
[ 1221.469955]  [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 [gfs2]
[ 1221.470236]  [<ffffffff8129cb2c>] ?
security_inode_permission+0x1c/0x30
[ 1221.470536]  [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0
[ 1221.470802]  [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60
[ 1221.471040]  [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0
[ 1221.471290]  [<ffffffff81177fc0>] ? vfs_write+0x110/0x180
[ 1221.471538]  [<ffffffff811782a7>] ? sys_write+0x67/0x90
[ 1221.471780]  [<ffffffff81188106>] sys_unlink+0x16/0x20
[ 1221.472019]  [<ffffffff81661fc2>] system_call_fastpath+0x16/0x1b
[ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 00 
e8
fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d 08 
ff
ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff ff 
48
[ 1221.473936] RIP  [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 
[gfs2]
[ 1221.474240]  RSP <ffff88020cfe1d28>
[ 1221.474408] CR2: 0000000000000018
[ 1221.474959] ---[ end trace f7df780fd45600a8 ]---

Bart Verwilst schreef op 20.08.2012 09:50:
> Nothing out of the ordinary, should have mentioned that!
>
> <snip>
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
20
> bytes from fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
client
> command is 7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
to
> process command
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
to
> process is 7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> get_all_members: allocated new buffer (retsize=1024)
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb:
> get_all_members: retlen = 1760
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> return code is 4
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> Returning command data. length = 1760
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
sending
> reply 40000007 to fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
20
> bytes from fd 17
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
client
> command is 800000b7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
to
> process command
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command 
to
> process is 800000b7
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] memb: command
> return code is 0
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon:
> Returning command data. length = 0
> Aug 19 00:08:00 vm02-test corosync[7394]:   [CMAN  ] daemon: 
sending
> reply c00000b7 to fd 17
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: read 
20
> bytes from fd 17
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: 
client
> command is 7
> Aug 19 00:08:01 vm02-test corosync[7394]:   [CMAN  ] daemon: About 
to
> process command
> </snip>
>
> Digimer schreef op 20.08.2012 00:01:
>> On 08/19/2012 05:34 PM, Bart Verwilst wrote:
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task
>>> kworker/1:0:3117 blocked for more than 120 seconds.
>>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>
>> Nothing around Aug 19 00:08:00 ?

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster