Re: Kernel problem on rx2800 i2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm running out of luck trying to understand the issue with the
zone list.  Adding the ia64 mailing list in addition to Tony
to see if someone can figure out how a alloc_pages_node for the
node stored in an AHCI PCIe pci_dev could cause an oops in the
zonelist lookup.

On Fri, Jun 21, 2019 at 10:08:06PM +0200, Frank Scheiner wrote:
> Hi there,
>
> recent testing of a Debian v4.19.37 kernel showed a problem on my rx2800
> i2 happening during kernel boot:
>
> ```
> [    0.000000] Linux version 4.19.0-5-itanium
> (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 8.3.0 (Debian
> 8.3.0-10~ia64.1)) #1 SMP Debian 4.19.37-3 (2019-05-18)
> [    0.000000] EFI v2.10 by HP:
> [    0.000000] efi:  SALsystab=0x6fdd63a18  ACPI 2.0=0x3d3c4014
> HCDP=0x6ffff8798  SMBIOS=0x3d368000
> [    0.000000] booting generic kernel on platform dig
> [    0.000000] PCDP: v3 at 0x6ffff8798
> [    0.000000] earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
> [    0.000000] bootconsole [uart8250] enabled
> [    0.000000] ACPI: Early table checksum verification disabled
> [    0.000000] ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
> [    0.000000] ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2
> 00000001      01000013)
> [...]
> [   13.993718] Unpacking initramfs...
> [...]
> [   22.655630] Run /init as init process
> [   22.818930] SCSI subsystem initialized
> [   22.844653] ACPI: bus type USB registered
> [   22.878940] HP HPSA Driver (v 3.4.20-125)
> [   22.930628] usbcore: registered new interface driver usbfs
> [   23.072034] usbcore: registered new interface driver hub
> [   23.072925] hpsa 0000:01:00.0: Logical aborts not supported
> [   23.150942] usbcore: registered new device driver usb
> [   23.231690] hpsa 0000:01:00.0: HP SSD Smart Path aborts not supported
> [   23.306942] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> [   23.417101] systemd-udevd[115]: NaT consumption 2216203124768 [1]
> [   23.488663] ehci-pci: EHCI PCI platform driver
> [   23.490942] uhci_hcd: USB Universal Host Controller Interface driver
> [   23.420927] Modules linked in: uhci_hcd(+) ehci_pci(+) ehci_hcd
> hpsa(+) scsi_transport_sas usbcore scsi_mod usb_common
> [   23.420927]
> [   23.420927] CPU: 6 PID: 115 Comm: systemd-udevd Not tainted
> 4.19.0-5-itanium #1 Debian 4.19.37-3
> [   23.420927] Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012
> [   23.420927] psr : 0000121008026010 ifs : 8000000000002046 ip  :
> [<a0000001002af041>]    Not tainted (4.19.0-5-itanium Debian 4.19.37-3)
> [   23.420927] ip is at __alloc_pages_nodemask+0x261/0x20c0
> [   23.420927] unat: 0000000000000000 pfs : 0000000000000793 rsc :
> 0000000000000003
> [   23.420927] rnat: 0000000000000000 bsps: 0000000000000000 pr  :
> 85aaa9a99a6a6659
> [   23.420927] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr:
> 0009804c8a70433f
> [   23.420927] csd : 0000000000000000 ssd : 0000000000000000
> [   23.420927] b0  : a0000001001710e0 b6  : a0000001003948c0 b7  :
> a0000001000469c0
> [   23.420927] f6  : 10012bffff00000000000 f7  : 1003e00000000000bffff
> [   23.420927] f8  : 1003e0000000000003fc0 f9  : 1003effffffffffffffab
> [   23.420927] f10 : 10016818d087e7cd81a78 f11 : 1003e000000000000002a
> [   23.420927] r1  : a0000001015d6ba0 r2  : a0000001013643c8 r3  :
> fffffffffffc04b8
> [   23.420927] r8  : 0000000000001440 r9  : e000000001507708 r10 :
> 0000000000000008
> [   23.420927] r11 : ffffffffffd8d818 r12 : e000000682fcfbd0 r13 :
> e000000682fc8000
> [   23.420927] r14 : a0000001013643b8 r15 : ffffffffffd8d828 r16 :
> 00000000007fffff
> [   23.420927] r17 : 0000000000000008 r18 : 0000000000000000 r19 :
> e000000001507710
> [   23.420927] r20 : 0000000000000000 r21 : 0000000000002500 r22 :
> 0000000000000000
> [   23.420927] r23 : 0000000000000000 r24 : 0000000000000000 r25 :
> 0000000000000000
> [   23.420927] r26 : 0000000000000000 r27 : 0000000000000000 r28 :
> e000000682fc87b0
> [   23.420927] r29 : 0000000000200000 r30 : 0000000000000000 r31 :
> 0000000000000000
> [   23.420927]
> [   23.420927] Call Trace:
> [   23.420927]  [<a000000100014bd0>] show_stack+0x90/0xc0
> [   23.420927]                                 sp=e000000682fcf790
> bsp=e000000682fc9c80
> [   23.420927]  [<a0000001000152d0>] show_regs+0x6d0/0xa00
> [   23.420927]                                 sp=e000000682fcf960
> bsp=e000000682fc9c10
> [   23.420927]  [<a000000100029330>] die+0x1b0/0x460
> [   23.420927]                                 sp=e000000682fcf980
> bsp=e000000682fc9bc8
> [   23.420927]  [<a000000100e75100>] ia64_fault+0x5a0/0xf60
> [   23.420927]                                 sp=e000000682fcf980
> bsp=e000000682fc9b70
> [   23.420927]  [<a00000010000c9c0>] ia64_leave_kernel+0x0/0x270
> [   23.420927]                                 sp=e000000682fcfa00
> bsp=e000000682fc9b70
> [   23.420927]  [<a0000001002af040>] __alloc_pages_nodemask+0x260/0x20c0
> [   23.420927]                                 sp=e000000682fcfbd0
> bsp=e000000682fc9938
> [   23.420927]  [<a0000001001710e0>] dma_direct_alloc+0x140/0x2e0
> [   23.420927]                                 sp=e000000682fcfc40
> bsp=e000000682fc98c0
> [   23.420927]  [<a000000100173910>] swiotlb_alloc+0x50/0x2e0
> [   23.420927]                                 sp=e000000682fcfc40
> bsp=e000000682fc9868
> ```
>
> The machine doesn't continue boot afterwards. The machine boots fine
> with a 4.14.x with Gentoo patches but also no later minor kernel version
> with Gentoo patches works on it. With some testing I could limit the
> Linux versions, between which the problematic change could have been
> introduced, to 4.15.x and 4.16.x. Bisecting between tag v4.15.18 (good)
> and tag v4.16-rc1 (bad) pointed to commit
> 543cea9accd9804307541cb93d3ed7ec94b07237 ([1]) as first bad commit.
>
> [1]:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237
>
> The kernel messages with problematic kernels from the bisecting process
> look different to the ones from the above shown v4.19.37 from Debian though:
>
> ```
> Linux version 4.15.0-rc7-00047-g543cea9accd9-dirty (root@rx2800-i2) (gcc
> version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 13 22:16:30 CEST 2019
> EFI v2.10 by HP:
> efi:  SALsystab=0xdfdd63a18  ACPI 2.0=0x3d3c4014  HCDP=0xdffff8798
> SMBIOS=0x3d368000
> booting generic kernel on platform dig
> PCDP: v3 at 0xdffff8798
> earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
> bootconsole [uart8250] enabled
> ACPI: Early table checksum verification disabled
> ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
> ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001
> 01000013)
> [...]
> Trying to unpack rootfs image as initramfs...
> [...]
> Loading Adaptec I2O RAID: Version 2.4 Build 5go
> Detecting Adaptec I2O RAID controllers...
> ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA
> mode
> ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ccc ems
> Unable to handle kernel NULL pointer dereference (address 0000000000001688)
> swapper/0[1]: Oops 11012296146944 [1]
> Modules linked in:
>
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted
> 4.15.0-rc7-00047-g543cea9accd9-dirty #1
> Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012
> psr : 00001210084a6010 ifs : 8000000000001734 ip  : [<a000000100180401>]
>    Not tainted (4.15.0-rc7-00047-g543cea9accd9-dirty)
> ip is at __alloc_pages_nodemask+0x1a1/0x1670
> unat: 0000000000000000 pfs : 0000000000001734 rsc : 0000000000000003
> rnat: 000000038c5ad78d bsps: 000000000001003e pr  : 565595a66aa65799
> ldrs: 0000000000000000 ccv : 000000032e40a799 fpsr: 0009804c8a70433f
> csd : 0000000000000000 ssd : 0000000000000000
> b0  : a0000001001802c0 b6  : a000000100050b50 b7  : a0000001007e83d0
> f6  : 1003e0000000000000000 f7  : 1003e00000000000164ff
> f8  : 1003e0000000000000f00 f9  : 1003e000000000000000f
> f10 : 1003e0000000000000400 f11 : 1003e0000000000003c00
> r1  : a00000010155edc0 r2  : a0000001012b5e90 r3  : 0000000001ffffff
> r8  : 0000000000001680 r9  : 0000000000250015 r10 : e000000001519980
> r11 : e000000001519988 r12 : e000000d8334fcf0 r13 : e000000d83348000
> r14 : ffffffffffd570d0 r15 : 0000000000000008 r16 : e000000001519990
> r17 : 0000000000000000 r18 : 0000000000001680 r19 : 0000000000000000
> r20 : 0000000000000000 r21 : 0000000000000000 r22 : 0000000000000000
> r23 : 0000000000000000 r24 : ffffffffffd570c0 r25 : a0000001012b5e80
> r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000001688
> r29 : 0000000000000358 r30 : 0000000000000000 r31 : 0000000000000081
>
> Call Trace:
>  [<a000000100013760>] show_stack+0x40/0x90
>                                 sp=e000000d8334f8c0 bsp=e000000d83349890
>  [<a0000001000140e0>] show_regs+0x930/0x940
>                                 sp=e000000d8334fa90 bsp=e000000d83349820
>  [<a00000010003a7d0>] die+0x1a0/0x2f0
>                                 sp=e000000d8334fa90 bsp=e000000d833497d8
>  [<a000000100063140>] ia64_do_page_fault+0x830/0xa30
>                                 sp=e000000d8334fa90 bsp=e000000d83349740
>  [<a00000010000c400>] ia64_leave_kernel+0x0/0x270
>                                 sp=e000000d8334fb20 bsp=e000000d83349740
>  [<a000000100180400>] __alloc_pages_nodemask+0x1a0/0x1670
>                                 sp=e000000d8334fcf0 bsp=e000000d83349598
>  [<a000000100d70100>] dma_direct_alloc+0x170/0x470
>                                 sp=e000000d8334fd50 bsp=e000000d83349518
>  [<a0000001006a8770>] swiotlb_alloc+0x50/0x90
>                                 sp=e000000d8334fd50 bsp=e000000d833494d8
>  [<a00000010083abd0>] dmam_alloc_coherent+0x250/0x2c0
>                                 sp=e000000d8334fd50 bsp=e000000d83349488
>  [<a0000001009990c0>] ahci_port_start+0x2f0/0x4b0
>                                 sp=e000000d8334fd50 bsp=e000000d83349440
>  [<a000000100958490>] ata_host_start+0x310/0x470
>                                 sp=e000000d8334fd60 bsp=e000000d833493d0
>  [<a000000100964a70>] ata_host_activate+0x20/0x290
>                                 sp=e000000d8334fd60 bsp=e000000d83349370
>  [<a000000100999570>] ahci_host_activate+0x2f0/0x300
>                                 sp=e000000d8334fd60 bsp=e000000d83349300
>  [<a0000001009923d0>] ahci_init_one+0x1580/0x20b0
>                                 sp=e000000d8334fd60 bsp=e000000d83349258
>  [<a0000001006d0610>] local_pci_probe+0x90/0x150
>                                 sp=e000000d8334fdc0 bsp=e000000d83349218
>  [<a0000001006d1a30>] pci_device_probe+0x2f0/0x310
>                                 sp=e000000d8334fdc0 bsp=e000000d833491d8
>  [<a0000001008229f0>] driver_probe_device+0x520/0x720
>                                 sp=e000000d8334fde0 bsp=e000000d83349170
>  [<a000000100822d10>] __driver_attach+0x120/0x190
>                                 sp=e000000d8334fde0 bsp=e000000d83349140
>  [<a00000010081ec00>] bus_for_each_dev+0x120/0x140
>                                 sp=e000000d8334fde0 bsp=e000000d83349100
>  [<a000000100821bf0>] driver_attach+0x40/0x60
>                                 sp=e000000d8334fdf0 bsp=e000000d833490e0
>  [<a0000001008211b0>] bus_add_driver+0x400/0x4a0
>                                 sp=e000000d8334fdf0 bsp=e000000d83349090
>  [<a000000100823fc0>] driver_register+0x240/0x2d0
>                                 sp=e000000d8334fdf0 bsp=e000000d83349068
>  [<a0000001006cfde0>] __pci_register_driver+0xa0/0xc0
>                                 sp=e000000d8334fdf0 bsp=e000000d83349038
>  [<a0000001010ecdb0>] ahci_pci_driver_init+0x50/0x70
>                                 sp=e000000d8334fdf0 bsp=e000000d83349020
>  [<a00000010000a950>] do_one_initcall+0x290/0x2a0
>                                 sp=e000000d8334fdf0 bsp=e000000d83348fe0
>  [<a0000001010a1c10>] kernel_init_freeable+0x400/0x430
>                                 sp=e000000d8334fe30 bsp=e000000d83348f78
>  [<a000000100d93860>] kernel_init+0x20/0x280
>                                 sp=e000000d8334fe30 bsp=e000000d83348f58
>  [<a00000010000c1f0>] call_payload+0x50/0x80
>                                 sp=e000000d8334fe30 bsp=e000000d83348f40
> Disabling lock debugging due to kernel taint
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>
> ---[ end Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000b
> ```
>
> ...but because of the result below - spoiler: a v4.19.37 kernel working
> on my rx2800 i2 - I assume they're created by the very same issue.
>
> Starting at tag v4.19.37 I then reverted the following commits:
>
> * cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80 [2]
>
> * 16e73adbca76fd18733278cb688b0ddb4cad162c [3]
>
> * 9d37c094dacda531ac3e529dd4dd139e3c0b7811 [4]
>
> * 4fac8076df854aa4ddb8acbf6cce9d337300219e [5]
>
> * 543cea9accd9804307541cb93d3ed7ec94b07237 [6]
>
> ...and compiled a kernel using the localmodconfig target to create a
> minimal config. The resulting kernel booted fine on my rx2800 i2:
>
> ```
> Linux version 4.19.37-00005-g55bd603c2590-dirty (root@rx2800-i2) (gcc
> version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 20 23:58:57 CEST 2019
> EFI v2.10 by HP:
> efi:  SALsystab=0xdfdd63a18  ACPI 2.0=0x3d3c4014  HCDP=0xdffff8798
> SMBIOS=0x3d368000
> booting generic kernel on platform dig
> PCDP: v3 at 0xdffff8798
> earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
> bootconsole [uart8250] enabled
> ACPI: Early table checksum verification disabled
> ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
> ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001
> 01000013)
> [...]
>  * Starting sshd ...
>  [ ok ]
>  * Starting local ...
>  [ ok ]
>
>
> This is rx2800-i2[...] (Linux ia64 4.19.37-00005-g55bd603c2590-dirty)
> 20:49:42
> ```
>
> [2]:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80
>
> [3]:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=16e73adbca76fd18733278cb688b0ddb4cad162c
>
> [4]:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9d37c094dacda531ac3e529dd4dd139e3c0b7811
>
> [5]:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4fac8076df854aa4ddb8acbf6cce9d337300219e
>
> [6]:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237
>
> ****
>
> Please note:
>
> * that I'm always using the "ia64: fix ptrace" patch ([7]) in addition,
> as I'm compiling with gcc 7.3.0 on Gentoo;
>
> [7]: https://lore.kernel.org/patchwork/patch/884685/
>
> * that the original problem only shows on my rx2800 i2 and not on my
> other ia64 gear (rx4640 with Madison, rx2620 with Montecito and rx2660
> with Montvale), so could be related to the different system architecture
> of the Tukwila based rx2800 i2 (UMA => NUMA IIC);
>
> I just now tried to compile a more recent v5.2-rc5 kernel with the above
> commits reverted, but that fails. There seem to have been further
> changes made since v4.19.37 for which I would still need to find the
> respective commits to revert. But I assume this work could be unneeded
> for a further examination of the problem, so I don't follow this for
> now. If it is needed please let me know.
>
> James Clarke already had an idea what could be involved in this issue.
> Maybe he can give his assessment.
>
> If you want me to try a patch for a specific Linux version, please let
> me know. The same if you need further information from me.
>
> Cheers
> Frank
---end quoted text---



[Index of Archives]     [Linux Kernel]     [Sparc Linux]     [DCCP]     [Linux ARM]     [Yosemite News]     [Linux SCSI]     [Linux x86_64]     [Linux for Ham Radio]

  Powered by Linux