On Tue, Jun 11, 2013 at 06:24:24PM -0700, Eric W. Biederman wrote: > Cliff Wickman <cpw at sgi.com> writes: > > > I'm getting a hang when trying to enter a high-memory crash kernel, > > and I'm at a loss as to how to debug this. > > > > This is a 3.10.0-rc3 kernel, and set up as the crash kernel by kexec 2.0.4. > > The machine is an SGI UV1000. > > > > [ 164.027275] SysRq : Trigger a crash > > [ 164.031136] BUG: unable to handle kernel NULL pointer dereference at (null) > > [ 164.031136] IP: [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20 > > [ 164.031136] PGD 1fbe835067 PUD 1fbc2e8067 PMD 0 > > [ 164.031136] Oops: 0002 [#1] SMP > > [ 164.031136] xpc : all partitions have deactivated > > [ 164.031136] Modules linked in: autofs4 binfmt_misc af_packet rdma_ucm rdma_cm iw_cm ib_addr ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_cm ib_uverbs ib_umad iw_cxgb3 cxgb3 mdio mlx4_en mlx4_ib ib_sa mlx4_core ib_mthca ib_mad ib_core fuse nls_iso8859_1 nls_cp437 vfat fat loop uv_mmtimer dm_mod sr_mod cdrom usb_storage iTCO_wdt iTCO_vendor_support coretemp mperf kvm_intel ipv6 kvm igb sg crc32c_intel lpc_ich pcspkr mptctl i2c_algo_bit ptp i2c_i801 microcode xhci_hcd joydev ioatdma ehci_pci hid_generic pps_core i2c_core rtc_cmos mfd_core button dca usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif scsi_dh_hp_sw scsi_dh_alua scsi_dh_emc scsi_dh_rdac scsi_dh thermal sata_nv processor piix mptsas mptscsih scsi_transport_sas mptbase megaraid_sas fan thermal_sys hwmon ext3 jbd ata_piix ahci libahci libata scsi_mod > > [ 164.031136] CPU: 10 PID: 9299 Comm: dopanic Not tainted 3.10.0-rc3-linus-cpw+ #17 > > [ 164.031136] Hardware name: Intel Corp. Stoutland Platform, BIOS 2.16 UEFI2.10 PI1.0 X64 2012-04-27 > > [ 164.031136] task: ffff88203df94440 ti: ffff88203d5c2000 task.ti: ffff88203d5c2000 > > [ 164.031136] RIP: 0010:[<ffffffff81397771>] [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20 > > [ 164.031136] RSP: 0018:ffff88203d5c3e68 EFLAGS: 00010092 > > [ 164.031136] RAX: 000000000000000f RBX: ffffffff81a974e0 RCX: 0000000000000004 > > [ 164.031136] RDX: 0000000000000000 RSI: ffff881fffd0ef48 RDI: 0000000000000063 > > [ 164.031136] RBP: ffff88203d5c3e68 R08: ffff881fffd0d3e8 R09: 000000000004268c > > [ 164.031136] R10: 0000000000000b8b R11: 0000000000000000 R12: 0000000000000063 > > [ 164.031136] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000296 > > [ 164.031136] FS: 00007ffff7fb5700(0000) GS:ffff881fffd00000(0000) knlGS:0000000000000000 > > [ 164.031136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 164.031136] CR2: 0000000000000000 CR3: 0000001fbea6c000 CR4: 00000000000007e0 > > [ 164.031136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 164.031136] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > [ 164.031136] Stack: > > [ 164.031136] ffff88203d5c3ea8 ffffffff81398008 01ff88203d5c3e88 0000000000000002 > > [ 164.031136] ffff895f9d478380 ffff88203d5c3f40 00007ffff7ff8000 ffff88203d5c3f40 > > [ 164.031136] ffff88203d5c3ec8 ffffffff813980ad ffff88203d5c3ee8 fffffffffffffffb > > [ 164.031136] Call Trace: > > [ 164.031136] [<ffffffff81398008>] __handle_sysrq+0x128/0x190 > > [ 164.031136] [<ffffffff813980ad>] write_sysrq_trigger+0x3d/0x40 > > [ 164.031136] [<ffffffff811c323f>] proc_reg_write+0x4f/0x80 > > [ 164.031136] [<ffffffff8115f107>] vfs_write+0xe7/0x190 > > [ 164.031136] [<ffffffff8115f8ec>] SyS_write+0x5c/0xa0 > > [ 164.031136] [<ffffffff8153c092>] system_call_fastpath+0x16/0x1b > > [ 164.031136] Code: 00 48 8b 75 e8 48 81 c7 08 08 00 00 e8 09 c6 19 00 31 d2 eb 95 90 90 90 90 90 55 c7 05 f5 74 96 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 c9 c3 0f 1f 44 00 00 8d 47 d0 55 83 f8 > > [ 164.031136] RIP [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20 > > [ 164.031136] RSP <ffff88203d5c3e68> > > [ 164.031136] CR2: 0000000000000000 > > > > This is always the last output. > > > > Can anyone suggest any way to debug this problem? > > > > I suppose I can hang the processor just before it executes machine_kexec() > > and look at it with crash. Any suggestions as to what to look at? Hi Eric, Thanks for the reply. > Hmm. You can enable print statements in purgatory.c. There is a > command line switch that allows pugatory to print to a serial console. > That should be a simple easy thing to try. Do you recall what that switch is? I don't see any condition in purgatory.c. But the existing printf's don't give me anything. > I am totally lost as to the status of the patches to make all of this > work right. But the change to let purgator work above 4G was merged > early so hopefully it is not a problem in kexec. It works on a UV2000 or a whitebox, so it must be close. > > You might also want to enable early printk in the crash dump kernel. > Sometimes kernels get confused on the way up and we hang there. Yes, I'm using earlyprintk. -Cliff -- Cliff Wickman SGI cpw at sgi.com (651) 683-3824