I have just had my cluster crash yet again, but this time, I was able to
capture the full kernel panic.
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[<0000000000000000>] _stext+0x7ffff000/0x1000
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /kernel/dlm/rgmanager/control
CPU 1
Modules linked in: gfs(U) nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs sunrpc ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 ib_iser rdma_cm ib_cm iw_cm ib_addr ib_local_sa ib_sa ib_mad ib_core iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport shpchp serio_raw tg3 sg pcspkr dm_snapshot dm_zero dm_mirror dm_mod usb_storage ata_piix libata sd_mod scsi_mod raid1 ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 6215, comm: nfsd Not tainted 2.6.18-53.1.4.el5 #1
RIP: 0010:[<0000000000000000>] [<0000000000000000>] _stext+0x7ffff000/0x1000
RSP: 0018:ffff81006abd56e8 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff8100210a4518 RCX: 0000000000000f88
RDX: 0000000000000000 RSI: ffff81000148e2c0 RDI: ffff81007f5757c0
RBP: ffff81000148e2c0 R08: 0400000000000000 R09: 0100000073747261
R10: 000000000c000000 R11: 0c000000c41d0000 R12: 0000000000000f88
R13: 0000000000000f88 R14: ffff81006a170078 R15: 0000000000000000
FS: 00002aaaab0166e0(0000) GS:ffff81007fe357c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
Process nfsd (pid: 6215, threadinfo ffff81006abd4000, task ffff81006afbc7e0)
Stack: ffffffff8000fc3c 0000000000000f88 ffff81006abd5d08 0000000000000000
0000000000000001 ffff81006abd5910 0000000000000001 00000f8800000000
ffff81007f5757c0 ffff810021d855c0 ffff8100210a4518 ffff810021d854b0
Call Trace:
[<ffffffff8000fc3c>] generic_file_buffered_write+0x4cb/0x6d8
[<ffffffff8000ddd9>] current_fs_time+0x3b/0x40
[<ffffffff80015dc6>] __generic_file_aio_write_nolock+0x36c/0x3b8
[<ffffffff885c0a5d>] :gfs:gfs_dreread+0x72/0xc7
[<ffffffff800be014>] generic_file_aio_write_nolock+0x20/0x6c
[<ffffffff800be3e0>] generic_file_write_nolock+0x8f/0xa8
[<ffffffff8009b492>] autoremove_wake_function+0x0/0x2e
[<ffffffff885e7e68>] :gfs:gfs_trans_begin_i+0x13c/0x1b2
[<ffffffff885db3a1>] :gfs:do_write_buf+0x443/0x67e
[<ffffffff885dabb6>] :gfs:walk_vm+0x10e/0x311
[<ffffffff885daf5e>] :gfs:do_write_buf+0x0/0x67e
[<ffffffff8006108d>] wait_for_completion+0x1f/0xa2
[<ffffffff885dae65>] :gfs:__gfs_write+0xac/0xc6
[<ffffffff800d5ee7>] do_readv_writev+0x198/0x295
[<ffffffff885daea8>] :gfs:gfs_write+0x0/0x8
[<ffffffff885dc429>] :gfs:gfs_open+0x12c/0x15e
[<ffffffff8857a77d>] :nfsd:nfsd_vfs_write+0xf2/0x2e1
[<ffffffff885dc2fd>] :gfs:gfs_open+0x0/0x15e
[<ffffffff8001e115>] __dentry_open+0x101/0x1dc
[<ffffffff8857aff1>] :nfsd:nfsd_write+0xb5/0xd5
[<ffffffff88581c96>] :nfsd:nfsd3_proc_write+0xea/0x109
[<ffffffff885771c4>] :nfsd:nfsd_dispatch+0xd7/0x198
[<ffffffff883e1514>] :sunrpc:svc_process+0x44d/0x70b
[<ffffffff800625bf>] __down_read+0x12/0x92
[<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff885776fb>] :nfsd:nfsd+0x1ae/0x2db
[<ffffffff8005bfb1>] child_rip+0xa/0x11
[<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8005bfa7>] child_rip+0x0/0x11
Code: Bad RIP value.
RIP [<0000000000000000>] _stext+0x7ffff000/0x1000
RSP <ffff81006abd56e8>
CR2: 0000000000000000
<0>Kernel panic - not syncing: Fatal exception
I'm experiencing upwards of 8 crashes a day because of this. What can I do
about it?
Thanks,
James
On Wed, 5 Mar 2008, James Chamberlain wrote:
Two of the three nodes in my CS/GFS cluster just crashed, which dissolved
quorum and allowed me to finally capture part of the kernel panic. Here is
what was displayed on the screen:
[<ffffffff885daea8>] :gfs:gfs_write+0x0/0x8
[<ffffffff885cb2a7>] :gfs:gfs_glock_d1+0x15c/0x16c
[<ffffffff885dc429>] :gfs:gfs_open+0x12c/0x15e
[<ffffffff8857a77d>] :nfsd:nfsd_vfs_write+0xf2/0x2e1
[<ffffffff885dc2fd>] :gfs:gfs_open+0x0/0x15e
[<ffffffff8001e115>] __dentry_open+0x101/0x1dc
[<ffffffff8857aff1>] :nfsd:nfsd_write+0xb5/0xd5
[<ffffffff88581c96>] :nfsd:nfsd3_proc_write+0xea/0x109
[<ffffffff885771c4>] :nfsd:nfsd_dispatch+0xd7/0x198
[<ffffffff883e1514>] :sunrpc:svc_process+0x44d/0x70b
[<ffffffff800625bf>] __down_read+0x12/0x92
[<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff885776fb>] :nfsd:nfsd+0x1ae/0x2db
[<ffffffff8005bfb1>] child_rip+0xa/0x11
[<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8857754d>] :nfsd:nfsd+0x0/0x2db
[<ffffffff8005bfa7>] child_rip+0x0/0x11
Code: Bad RIP value.
RIP [<0000000000000000>] _stext+0x7fff000/0x1000
RSP <ffff81006ac9f6e8>
CR2: 0000000000000000
<0>Kernel panic - not syncing: Fatal exception
Is this enough to figure out what happened, and how can I prevent this from
happening in the future? I suspect that all the instability I have had with
my CS/GFS cluster is related to this sort of crash. I am using the
following on all three nodes:
cman-2.0.73-1.el5_1.1
openais-0.80.3-7.el5
rgmanager-2.0.31-1.el5.centos
lvm2-cluster-2.02.26-1.el5
luci-0.10.0-6.el5.centos.1
ricci-0.10.0-6.el5.centos.1
kernel-2.6.18-53.1.4.el5
gfs-utils-0.1.12-1.el5
kmod-gfs-0.1.19-7.el5_1.1
Thanks,
James
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster