Hi, On Tue, 2012-08-21 at 13:03 +0200, Bart Verwilst wrote: > Hi Steven, > > There is no drbd in the mix ( which is why i changed the title of the > bugreport now ). I'm only using plain iSCSI. The original posted had it > with drbd :) > > Kind regards, > > Bart > Ah, I see sorry.. I misunderstood the report. I wonder whether your distro kernel has this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=718b97bd6b03445be53098e3c8f896aeebc304aa Thats the most likely thing that I can see that has been fixed recently, Steve. > Steven Whitehouse schreef op 21.08.2012 12:59: > > Hi, > > > > On Tue, 2012-08-21 at 12:39 +0200, Bart Verwilst wrote: > >> Hi Steven > >> > >> Shared storage is iSCSI, > >> > >> <totem rrp_mode="none" secauth="off" token="20000"/> > >> <quorumd tko="4" interval="2" > >> device="/dev/mapper/iscsi_cluster_quorum"></quorumd> > >> > >> Actually i know why this is happening now, and can reproduce 100% of > >> the time, i've added my findings as a comment to this bug from > >> somebody > >> having the same problem: > >> > >> https://bugs.launchpad.net/ubuntu/+source/gfs2-utils/+bug/1020207 > >> > >> Create file on one node's gfs2 mount, rm on the other -> hanging > >> mountpoint + kernel OOPS. > >> > >> Happy that i'm finally getting somewhere with this :P > >> > >> Anything i can do to help Steven? > >> > >> Kind regards, > >> > >> Bart Verwilst > >> > > Can you reproduce this without drbd in the mix? That should remove > > one > > complication and make this easier to track down. > > > > I'll take a look at see what that dereference is likely to be in the > > mean time, > > > > Steve. > > > >> Steven Whitehouse schreef op 21.08.2012 12:17: > >> > Hi, > >> > > >> > On Tue, 2012-08-21 at 12:08 +0200, Bart Verwilst wrote: > >> >> As yet another reply to my own post, i found this on the node > >> where > >> >> it > >> >> hangs ( this time it's vm01, and /var/lib/libvirt/sanlock that's > >> >> hanging > >> >> ): > >> >> > >> >> > >> >> [ 1219.640653] GFS2: fsid=: Trying to join cluster "lock_dlm", > >> >> "kvm:sanlock" > >> >> [ 1219.660035] GFS2: fsid=kvm:sanlock.0: Joined cluster. Now > >> >> mounting > >> >> FS... > >> >> [ 1219.720108] GFS2: fsid=kvm:sanlock.0: jid=0, already locked > >> for > >> >> use > >> >> [ 1219.720113] GFS2: fsid=kvm:sanlock.0: jid=0: Looking at > >> >> journal... > >> >> [ 1219.772606] GFS2: fsid=kvm:sanlock.0: jid=0: Acquiring the > >> >> transaction lock... > >> >> [ 1219.772659] GFS2: fsid=kvm:sanlock.0: jid=0: Replaying > >> journal... > >> >> [ 1219.772675] GFS2: fsid=kvm:sanlock.0: jid=0: Replayed 0 of 0 > >> >> blocks > >> >> [ 1219.772679] GFS2: fsid=kvm:sanlock.0: jid=0: Found 1 revoke > >> tags > >> >> [ 1219.782611] init: libvirt-bin main process ended, respawning > >> >> [ 1219.784161] GFS2: fsid=kvm:sanlock.0: jid=0: Journal replayed > >> in > >> >> 1s > >> >> [ 1219.784268] GFS2: fsid=kvm:sanlock.0: jid=0: Done > >> >> [ 1219.784329] GFS2: fsid=kvm:sanlock.0: jid=1: Trying to acquire > >> >> journal lock... > >> >> [ 1219.788349] GFS2: fsid=kvm:sanlock.0: jid=1: Looking at > >> >> journal... > >> >> [ 1219.886047] GFS2: fsid=kvm:sanlock.0: jid=1: Done > >> >> [ 1219.886110] GFS2: fsid=kvm:sanlock.0: jid=2: Trying to acquire > >> >> journal lock... > >> >> [ 1219.891121] GFS2: fsid=kvm:sanlock.0: jid=2: Looking at > >> >> journal... > >> >> [ 1219.943994] init: ttyS1 main process (20318) terminated with > >> >> status > >> >> 1 > >> >> [ 1219.944037] init: ttyS1 main process ended, respawning > >> >> [ 1219.967054] init: ttyS0 main process (20320) terminated with > >> >> status > >> >> 1 > >> >> [ 1219.967100] init: ttyS0 main process ended, respawning > >> >> [ 1219.972037] ttyS0: LSR safety check engaged! > >> >> [ 1220.226027] GFS2: fsid=kvm:sanlock.0: jid=2: Acquiring the > >> >> transaction lock... > >> >> [ 1220.226160] GFS2: fsid=kvm:sanlock.0: jid=2: Replaying > >> journal... > >> >> [ 1220.311801] GFS2: fsid=kvm:sanlock.0: jid=2: Replayed 3 of 106 > >> >> blocks > >> >> [ 1220.311805] GFS2: fsid=kvm:sanlock.0: jid=2: Found 4487 revoke > >> >> tags > >> >> [ 1220.322148] GFS2: fsid=kvm:sanlock.0: jid=2: Journal replayed > >> in > >> >> 1s > >> >> [ 1220.322253] GFS2: fsid=kvm:sanlock.0: jid=2: Done > >> > > >> > So that looks like it successfully recovered the journals for > >> nodes > >> > one > >> > and two. How many nodes are in the cluster? What is the fencing > >> > quorum > >> > set up being used? > >> > > >> >> [ 1221.457120] BUG: unable to handle kernel NULL pointer > >> dereference > >> >> at > >> >> 0000000000000018 > >> > > >> > So this is a dereference of something which is 24 bytes into some > >> > structure or other. Certainly something which should not happen so > >> we > >> > need to take a look at that. > >> > > >> > Was this a one off, or something that you can reproduce? > >> > > >> > Steve. > >> > > >> > > >> >> [ 1221.457508] IP: [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 > >> >> [gfs2] > >> >> [ 1221.457958] PGD 2170f2067 PUD 20c0ff067 PMD 0 > >> >> [ 1221.458197] Oops: 0000 [#1] SMP > >> >> [ 1221.458374] CPU 0 > >> >> [ 1221.458460] Modules linked in: gfs2 dlm ipmi_si mptctl mptbase > >> >> ipmi_devintf ipmi_msghandler dell_rbu ip6table_filter ip6_tables > >> >> iptable_filter ip_tables ebtable_nat ebtables x_tables kvm_intel > >> kvm > >> >> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager > >> >> ocfs2_stackglue > >> >> configfs dm_round_robin ib_iser bridge rdma_cm ib_cm iw_cm ib_sa > >> >> ib_mad > >> >> ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi > >> scsi_transport_iscsi > >> >> bonding nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc 8021q > >> garp > >> >> stp > >> >> joydev dm_multipath dcdbas mac_hid i7core_edac edac_core > >> >> acpi_power_meter lp parport usbhid hid bnx2 mpt2sas > >> >> scsi_transport_sas > >> >> e1000e raid_class scsi_dh_rdac usb_storage [last unloaded: > >> ipmi_si] > >> >> [ 1221.463058] > >> >> [ 1221.463146] Pid: 20611, comm: libvirtd Not tainted > >> >> 3.2.0-26-generic > >> >> #41-Ubuntu Dell Inc. PowerEdge R310/05XKKK > >> >> [ 1221.463626] RIP: 0010:[<ffffffffa04f800a>] > >> [<ffffffffa04f800a>] > >> >> gfs2_unlink+0x8a/0x220 [gfs2] > >> >> [ 1221.464100] RSP: 0018:ffff88020cfe1d28 EFLAGS: 00010296 > >> >> [ 1221.464344] RAX: 0000000000000000 RBX: ffff88021ef58000 RCX: > >> >> ffff88020cfe1d40 > >> >> [ 1221.464662] RDX: 0000000000000000 RSI: 00000000000183f3 RDI: > >> >> ffff88022efa2440 > >> >> [ 1221.464979] RBP: ffff88020cfe1e38 R08: 4000000000000000 R09: > >> >> 0000000000000000 > >> >> [ 1221.465297] R10: fdd50265775f720a R11: 0000000000000003 R12: > >> >> ffff88021ef50000 > >> >> [ 1221.465615] R13: ffff88020cfe1d80 R14: ffff8802178980c0 R15: > >> >> ffff88022efa2000 > >> >> [ 1221.466115] FS: 00007f4d2c0e7700(0000) > >> GS:ffff880237200000(0000) > >> >> knlGS:0000000000000000 > >> >> [ 1221.466487] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> >> [ 1221.466748] CR2: 0000000000000018 CR3: 00000002175e4000 CR4: > >> >> 00000000000006f0 > >> >> [ 1221.467066] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > >> >> 0000000000000000 > >> >> [ 1221.467384] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > >> >> 0000000000000400 > >> >> [ 1221.467702] Process libvirtd (pid: 20611, threadinfo > >> >> ffff88020cfe0000, task ffff8802241cdbc0) > >> >> [ 1221.468091] Stack: > >> >> [ 1221.468199] 0000000000000003 ffff88022f108048 > >> ffff88020cfe1d58 > >> >> ffff88020cfe1d40 > >> >> [ 1221.468581] ffff88020cfe1d40 ffff88022f108000 > >> ffff88022354aa00 > >> >> 0000000000000001 > >> >> [ 1221.468963] 0000000000000000 0000000000000000 > >> ffffffffa04f7fda > >> >> ffff88020cfe1d80 > >> >> [ 1221.469346] Call Trace: > >> >> [ 1221.469486] [<ffffffffa04f7fda>] ? gfs2_unlink+0x5a/0x220 > >> [gfs2] > >> >> [ 1221.469955] [<ffffffffa04f7ff4>] ? gfs2_unlink+0x74/0x220 > >> [gfs2] > >> >> [ 1221.470236] [<ffffffff8129cb2c>] ? > >> >> security_inode_permission+0x1c/0x30 > >> >> [ 1221.470536] [<ffffffff81184e70>] vfs_unlink.part.26+0x80/0xf0 > >> >> [ 1221.470802] [<ffffffff81184f1c>] vfs_unlink+0x3c/0x60 > >> >> [ 1221.471040] [<ffffffff8118758a>] do_unlinkat+0x1aa/0x1d0 > >> >> [ 1221.471290] [<ffffffff81177fc0>] ? vfs_write+0x110/0x180 > >> >> [ 1221.471538] [<ffffffff811782a7>] ? sys_write+0x67/0x90 > >> >> [ 1221.471780] [<ffffffff81188106>] sys_unlink+0x16/0x20 > >> >> [ 1221.472019] [<ffffffff81661fc2>] > >> system_call_fastpath+0x16/0x1b > >> >> [ 1221.472290] Code: 00 00 49 83 c5 40 31 d2 4c 89 e9 be 01 00 00 > >> 00 > >> >> e8 > >> >> fc 1e ff ff 48 8b b3 28 02 00 00 4c 89 ff e8 ad 7e 00 00 48 8d 8d > >> 08 > >> >> ff > >> >> ff ff <48> 8b 78 18 31 d2 be 01 00 00 00 48 83 e9 80 e8 d2 1e ff > >> ff > >> >> 48 > >> >> [ 1221.473936] RIP [<ffffffffa04f800a>] gfs2_unlink+0x8a/0x220 > >> >> [gfs2] > >> >> [ 1221.474240] RSP <ffff88020cfe1d28> > >> >> [ 1221.474408] CR2: 0000000000000018 > >> >> [ 1221.474959] ---[ end trace f7df780fd45600a8 ]--- > >> >> > >> >> > >> >> Bart Verwilst schreef op 20.08.2012 09:50: > >> >> > Nothing out of the ordinary, should have mentioned that! > >> >> > > >> >> > <snip> > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> read > >> >> 20 > >> >> > bytes from fd 17 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> >> client > >> >> > command is 7 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> About > >> >> to > >> >> > process command > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] memb: > >> command > >> >> to > >> >> > process is 7 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] memb: > >> >> > get_all_members: allocated new buffer (retsize=1024) > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] memb: > >> >> > get_all_members: retlen = 1760 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] memb: > >> command > >> >> > return code is 4 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> >> > Returning command data. length = 1760 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> >> sending > >> >> > reply 40000007 to fd 17 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> read > >> >> 20 > >> >> > bytes from fd 17 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> >> client > >> >> > command is 800000b7 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> About > >> >> to > >> >> > process command > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] memb: > >> command > >> >> to > >> >> > process is 800000b7 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] memb: > >> command > >> >> > return code is 0 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> >> > Returning command data. length = 0 > >> >> > Aug 19 00:08:00 vm02-test corosync[7394]: [CMAN ] daemon: > >> >> sending > >> >> > reply c00000b7 to fd 17 > >> >> > Aug 19 00:08:01 vm02-test corosync[7394]: [CMAN ] daemon: > >> read > >> >> 20 > >> >> > bytes from fd 17 > >> >> > Aug 19 00:08:01 vm02-test corosync[7394]: [CMAN ] daemon: > >> >> client > >> >> > command is 7 > >> >> > Aug 19 00:08:01 vm02-test corosync[7394]: [CMAN ] daemon: > >> About > >> >> to > >> >> > process command > >> >> > </snip> > >> >> > > >> >> > Digimer schreef op 20.08.2012 00:01: > >> >> >> On 08/19/2012 05:34 PM, Bart Verwilst wrote: > >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240067] INFO: task > >> >> >>> kworker/1:0:3117 blocked for more than 120 seconds. > >> >> >>> Aug 19 00:10:01 vm02-test kernel: [282120.240182] "echo 0 > > >> >> >>> /proc/sys/kernel/hung_task_timeout_secs" disables this > >> message. > >> >> >> > >> >> >> Nothing around Aug 19 00:08:00 ? > >> >> > >> >> -- > >> >> Linux-cluster mailing list > >> >> Linux-cluster@xxxxxxxxxx > >> >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster