Re: GFS2 crash

Scooter Morris <scooter@xxxxxxxxxxxx> · Wed, 17 Mar 2010 11:41:14 -0700

After removing kmod-gfs2 from all nodes, we ran just fine until last 
night, when we saw the same crash:

[2010-03-17 04:40:01]Wed Mar 17 05:40:01 PDT 2010
[2010-03-17 04:40:01]Unable to handle kernel NULL pointer dereference at 0000000000000078 RIP:
[2010-03-17 04:55:24] [<ffffffff88768383>] :gfs2:revoke_lo_add+0x1a/0x32
[2010-03-17 04:55:24]PGD 0
[2010-03-17 04:55:24]Oops: 0002 [1] SMP
[2010-03-17 04:55:24]last sysfs file: /devices/pci0000:00/0000:00:06.0/0000:0b:00.0/0000:0c:09.0/0000:0d:00.0/host0/rport-0:0-4/target0:0:4/0:0:4:1/state
[2010-03-17 04:55:24]CPU 7
[2010-03-17 04:55:24]Modules linked in: ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink iptable_filter ip_tables bridge autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs lockd sunrpc xt_tcpudp ipt_REJECT arpt_mangle arptable_filter arp_tables x_tables ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa ib_mad ib_core dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st ide_cd sg cdrom hpilo pcspkr serio_raw bnx2 dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
[2010-03-17 04:55:25]Pid: 792, comm: kswapd0 Not tainted 2.6.18-164.11.1.el5 #1
[2010-03-17 04:55:25]RIP: 0010:[<ffffffff88768383>]  [<ffffffff88768383>] :gfs2:revoke_lo_add+0x1a/0x32
[2010-03-17 04:55:25]RSP: 0018:ffff81082e073ae8  EFLAGS: 00010286
[2010-03-17 04:55:25]RAX: 0000000000000000 RBX: ffff810031d9c2b0 RCX: ffff810041619e40
[2010-03-17 04:55:25]RDX: ffff81063fc3d1b0 RSI: ffff810819749708 RDI: ffff810819749000
[2010-03-17 04:55:25]RBP: ffff81063fc3d190 R08: ffff81082fead486 R09: ffff81082e073b20
[2010-03-17 04:55:25]R10: ffff8101065ae8a0 R11: ffffffff88768369 R12: ffff810819749000
[2010-03-17 04:55:25]R13: 0000000000000000 R14: ffff810031d9c2b0 R15: ffff810819749000
[2010-03-17 04:55:26]FS:  0000000000000000(0000) GS:ffff81082fead340(0000) knlGS:0000000000000000
[2010-03-17 04:55:26]CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[2010-03-17 04:55:26]CR2: 0000000000000078 CR3: 0000000000201000 CR4: 00000000000006e0
[2010-03-17 04:55:26]Process kswapd0 (pid: 792, threadinfo ffff81082e072000, task ffff81082f4ef7e0)
[2010-03-17 04:55:26]Stack:  ffffffff8876983c 000000002e073e10 ffff810031d9c2b0 ffff81010e355078
[2010-03-17 04:55:26] 0000000000000000 0000000000000000 ffffffff8876a9a2 000000000000000e
[2010-03-17 04:55:26] ffff81010e355078 00000000000000b0 ffff81082e073cf0 ffff810819749000
[2010-03-17 04:55:26]Call Trace:
[2010-03-17 04:55:26] [<ffffffff8876983c>] :gfs2:gfs2_remove_from_journal+0x11f/0x131
[2010-03-17 04:55:26] [<ffffffff8876a9a2>] :gfs2:gfs2_invalidatepage+0xea/0x151
[2010-03-17 04:55:26] [<ffffffff8876a5e5>] :gfs2:gfs2_writepage_common+0x95/0xb1
[2010-03-17 04:55:26] [<ffffffff8876ac0f>] :gfs2:gfs2_jdata_writepage+0x56/0x98
[2010-03-17 04:55:26] [<ffffffff800ca21c>] shrink_inactive_list+0x3fd/0x8d8
[2010-03-17 04:55:26] [<ffffffff8004819b>] __pagevec_release+0x19/0x22
[2010-03-17 04:55:26] [<ffffffff800c9cfe>] shrink_active_list+0x4b4/0x4c4
[2010-03-17 04:55:26] [<ffffffff80013007>] shrink_zone+0xf7/0x15d
[2010-03-17 04:55:26] [<ffffffff80057e41>] kswapd+0x323/0x46c
[2010-03-17 04:55:26] [<ffffffff800a00b7>] autoremove_wake_function+0x0/0x2e
[2010-03-17 04:55:27] [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
[2010-03-17 04:55:27] [<ffffffff80057b1e>] kswapd+0x0/0x46c
[2010-03-17 04:55:27] [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
[2010-03-17 04:55:27] [<ffffffff80032950>] kthread+0xfe/0x132
[2010-03-17 04:55:27] [<ffffffff8009cd34>] request_module+0x0/0x14d
[2010-03-17 04:55:27] [<ffffffff8005dfb1>] child_rip+0xa/0x11
[2010-03-17 04:55:27] [<ffffffff8009fe9f>] keventd_create_kthread+0x0/0xc4
[2010-03-17 04:55:27] [<ffffffff80032852>] kthread+0x0/0x132
[2010-03-17 04:55:27] [<ffffffff8005dfa7>] child_rip+0x0/0x11
[2010-03-17 04:55:27]
[2010-03-17 04:55:27]
[2010-03-17 04:55:27]Code: ff 40 78 c7 40 50 01 00 00 00 ff 87 dc 06 00 00 48 89 d7 e9
[2010-03-17 04:55:27]RIP  [<ffffffff88768383>] :gfs2:revoke_lo_add+0x1a/0x32
[2010-03-17 04:55:27] RSP<ffff81082e073ae8>
[2010-03-17 04:55:27]CR2: 0000000000000078
[2010-03-17 04:55:27]<0>Kernel panic - not syncing: Fatal exception

So, it looks like it wasn't the old kmod-gfs2 :-(

-- scooter

On 03/04/2010 02:25 AM, Steven Whitehouse wrote:
Hi,

On Wed, 2010-03-03 at 21:23 -0800, Scooter Morris wrote:

Hi all,
      Just had a crash on our 3 node RedHat Enterprise Linux 5.4 cluster
that looks a lot like
https://bugzilla.redhat.com/show_bug.cgi?id=520720.  We're running
kernel 2.6.18-164.11.1.el5.  Here is the traceback:

That seems a reasonable conclusion. I assume that you were running with
one or more files with the journaled data flag set?

[snip]

Since we're already running the latest 5.4 kernel, it's not clear what
might be going on, here.  There is a note in the bug about making sure
the gfs2-kmod from 5.2 isn't still around.  What version of gfs2-kmod is
the old version, or should I just remove all instances of gfs2-kmod?

-- scooter

You can remove all versions of the kmod since they are all old. This is
the result of a packaging issue (which we are attempting to solve by
providing an empty kmod in future versions which will override the old
one) but in the mean time, upgrades from 5.2 or before require the old
gfs2 kmod to be removed manually.

I don't see any sign of the kmod in the stack trace you sent though, so
I suspect its not an issue in this case. Certainly worth checking though
to be certain.

Steve.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster