I'm running out of luck trying to understand the issue with the zone list. Adding the ia64 mailing list in addition to Tony to see if someone can figure out how a alloc_pages_node for the node stored in an AHCI PCIe pci_dev could cause an oops in the zonelist lookup. On Fri, Jun 21, 2019 at 10:08:06PM +0200, Frank Scheiner wrote: > Hi there, > > recent testing of a Debian v4.19.37 kernel showed a problem on my rx2800 > i2 happening during kernel boot: > > ``` > [ 0.000000] Linux version 4.19.0-5-itanium > (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 8.3.0 (Debian > 8.3.0-10~ia64.1)) #1 SMP Debian 4.19.37-3 (2019-05-18) > [ 0.000000] EFI v2.10 by HP: > [ 0.000000] efi: SALsystab=0x6fdd63a18 ACPI 2.0=0x3d3c4014 > HCDP=0x6ffff8798 SMBIOS=0x3d368000 > [ 0.000000] booting generic kernel on platform dig > [ 0.000000] PCDP: v3 at 0x6ffff8798 > [ 0.000000] earlycon: uart8250 at I/O port 0x4000 (options '115200n8') > [ 0.000000] bootconsole [uart8250] enabled > [ 0.000000] ACPI: Early table checksum verification disabled > [ 0.000000] ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) > [ 0.000000] ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 > 00000001 01000013) > [...] > [ 13.993718] Unpacking initramfs... > [...] > [ 22.655630] Run /init as init process > [ 22.818930] SCSI subsystem initialized > [ 22.844653] ACPI: bus type USB registered > [ 22.878940] HP HPSA Driver (v 3.4.20-125) > [ 22.930628] usbcore: registered new interface driver usbfs > [ 23.072034] usbcore: registered new interface driver hub > [ 23.072925] hpsa 0000:01:00.0: Logical aborts not supported > [ 23.150942] usbcore: registered new device driver usb > [ 23.231690] hpsa 0000:01:00.0: HP SSD Smart Path aborts not supported > [ 23.306942] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver > [ 23.417101] systemd-udevd[115]: NaT consumption 2216203124768 [1] > [ 23.488663] ehci-pci: EHCI PCI platform driver > [ 23.490942] uhci_hcd: USB Universal Host Controller Interface driver > [ 23.420927] Modules linked in: uhci_hcd(+) ehci_pci(+) ehci_hcd > hpsa(+) scsi_transport_sas usbcore scsi_mod usb_common > [ 23.420927] > [ 23.420927] CPU: 6 PID: 115 Comm: systemd-udevd Not tainted > 4.19.0-5-itanium #1 Debian 4.19.37-3 > [ 23.420927] Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012 > [ 23.420927] psr : 0000121008026010 ifs : 8000000000002046 ip : > [<a0000001002af041>] Not tainted (4.19.0-5-itanium Debian 4.19.37-3) > [ 23.420927] ip is at __alloc_pages_nodemask+0x261/0x20c0 > [ 23.420927] unat: 0000000000000000 pfs : 0000000000000793 rsc : > 0000000000000003 > [ 23.420927] rnat: 0000000000000000 bsps: 0000000000000000 pr : > 85aaa9a99a6a6659 > [ 23.420927] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: > 0009804c8a70433f > [ 23.420927] csd : 0000000000000000 ssd : 0000000000000000 > [ 23.420927] b0 : a0000001001710e0 b6 : a0000001003948c0 b7 : > a0000001000469c0 > [ 23.420927] f6 : 10012bffff00000000000 f7 : 1003e00000000000bffff > [ 23.420927] f8 : 1003e0000000000003fc0 f9 : 1003effffffffffffffab > [ 23.420927] f10 : 10016818d087e7cd81a78 f11 : 1003e000000000000002a > [ 23.420927] r1 : a0000001015d6ba0 r2 : a0000001013643c8 r3 : > fffffffffffc04b8 > [ 23.420927] r8 : 0000000000001440 r9 : e000000001507708 r10 : > 0000000000000008 > [ 23.420927] r11 : ffffffffffd8d818 r12 : e000000682fcfbd0 r13 : > e000000682fc8000 > [ 23.420927] r14 : a0000001013643b8 r15 : ffffffffffd8d828 r16 : > 00000000007fffff > [ 23.420927] r17 : 0000000000000008 r18 : 0000000000000000 r19 : > e000000001507710 > [ 23.420927] r20 : 0000000000000000 r21 : 0000000000002500 r22 : > 0000000000000000 > [ 23.420927] r23 : 0000000000000000 r24 : 0000000000000000 r25 : > 0000000000000000 > [ 23.420927] r26 : 0000000000000000 r27 : 0000000000000000 r28 : > e000000682fc87b0 > [ 23.420927] r29 : 0000000000200000 r30 : 0000000000000000 r31 : > 0000000000000000 > [ 23.420927] > [ 23.420927] Call Trace: > [ 23.420927] [<a000000100014bd0>] show_stack+0x90/0xc0 > [ 23.420927] sp=e000000682fcf790 > bsp=e000000682fc9c80 > [ 23.420927] [<a0000001000152d0>] show_regs+0x6d0/0xa00 > [ 23.420927] sp=e000000682fcf960 > bsp=e000000682fc9c10 > [ 23.420927] [<a000000100029330>] die+0x1b0/0x460 > [ 23.420927] sp=e000000682fcf980 > bsp=e000000682fc9bc8 > [ 23.420927] [<a000000100e75100>] ia64_fault+0x5a0/0xf60 > [ 23.420927] sp=e000000682fcf980 > bsp=e000000682fc9b70 > [ 23.420927] [<a00000010000c9c0>] ia64_leave_kernel+0x0/0x270 > [ 23.420927] sp=e000000682fcfa00 > bsp=e000000682fc9b70 > [ 23.420927] [<a0000001002af040>] __alloc_pages_nodemask+0x260/0x20c0 > [ 23.420927] sp=e000000682fcfbd0 > bsp=e000000682fc9938 > [ 23.420927] [<a0000001001710e0>] dma_direct_alloc+0x140/0x2e0 > [ 23.420927] sp=e000000682fcfc40 > bsp=e000000682fc98c0 > [ 23.420927] [<a000000100173910>] swiotlb_alloc+0x50/0x2e0 > [ 23.420927] sp=e000000682fcfc40 > bsp=e000000682fc9868 > ``` > > The machine doesn't continue boot afterwards. The machine boots fine > with a 4.14.x with Gentoo patches but also no later minor kernel version > with Gentoo patches works on it. With some testing I could limit the > Linux versions, between which the problematic change could have been > introduced, to 4.15.x and 4.16.x. Bisecting between tag v4.15.18 (good) > and tag v4.16-rc1 (bad) pointed to commit > 543cea9accd9804307541cb93d3ed7ec94b07237 ([1]) as first bad commit. > > [1]: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237 > > The kernel messages with problematic kernels from the bisecting process > look different to the ones from the above shown v4.19.37 from Debian though: > > ``` > Linux version 4.15.0-rc7-00047-g543cea9accd9-dirty (root@rx2800-i2) (gcc > version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 13 22:16:30 CEST 2019 > EFI v2.10 by HP: > efi: SALsystab=0xdfdd63a18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 > SMBIOS=0x3d368000 > booting generic kernel on platform dig > PCDP: v3 at 0xdffff8798 > earlycon: uart8250 at I/O port 0x4000 (options '115200n8') > bootconsole [uart8250] enabled > ACPI: Early table checksum verification disabled > ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) > ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 00000001 > 01000013) > [...] > Trying to unpack rootfs image as initramfs... > [...] > Loading Adaptec I2O RAID: Version 2.4 Build 5go > Detecting Adaptec I2O RAID controllers... > ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA > mode > ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ccc ems > Unable to handle kernel NULL pointer dereference (address 0000000000001688) > swapper/0[1]: Oops 11012296146944 [1] > Modules linked in: > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 4.15.0-rc7-00047-g543cea9accd9-dirty #1 > Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012 > psr : 00001210084a6010 ifs : 8000000000001734 ip : [<a000000100180401>] > Not tainted (4.15.0-rc7-00047-g543cea9accd9-dirty) > ip is at __alloc_pages_nodemask+0x1a1/0x1670 > unat: 0000000000000000 pfs : 0000000000001734 rsc : 0000000000000003 > rnat: 000000038c5ad78d bsps: 000000000001003e pr : 565595a66aa65799 > ldrs: 0000000000000000 ccv : 000000032e40a799 fpsr: 0009804c8a70433f > csd : 0000000000000000 ssd : 0000000000000000 > b0 : a0000001001802c0 b6 : a000000100050b50 b7 : a0000001007e83d0 > f6 : 1003e0000000000000000 f7 : 1003e00000000000164ff > f8 : 1003e0000000000000f00 f9 : 1003e000000000000000f > f10 : 1003e0000000000000400 f11 : 1003e0000000000003c00 > r1 : a00000010155edc0 r2 : a0000001012b5e90 r3 : 0000000001ffffff > r8 : 0000000000001680 r9 : 0000000000250015 r10 : e000000001519980 > r11 : e000000001519988 r12 : e000000d8334fcf0 r13 : e000000d83348000 > r14 : ffffffffffd570d0 r15 : 0000000000000008 r16 : e000000001519990 > r17 : 0000000000000000 r18 : 0000000000001680 r19 : 0000000000000000 > r20 : 0000000000000000 r21 : 0000000000000000 r22 : 0000000000000000 > r23 : 0000000000000000 r24 : ffffffffffd570c0 r25 : a0000001012b5e80 > r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000001688 > r29 : 0000000000000358 r30 : 0000000000000000 r31 : 0000000000000081 > > Call Trace: > [<a000000100013760>] show_stack+0x40/0x90 > sp=e000000d8334f8c0 bsp=e000000d83349890 > [<a0000001000140e0>] show_regs+0x930/0x940 > sp=e000000d8334fa90 bsp=e000000d83349820 > [<a00000010003a7d0>] die+0x1a0/0x2f0 > sp=e000000d8334fa90 bsp=e000000d833497d8 > [<a000000100063140>] ia64_do_page_fault+0x830/0xa30 > sp=e000000d8334fa90 bsp=e000000d83349740 > [<a00000010000c400>] ia64_leave_kernel+0x0/0x270 > sp=e000000d8334fb20 bsp=e000000d83349740 > [<a000000100180400>] __alloc_pages_nodemask+0x1a0/0x1670 > sp=e000000d8334fcf0 bsp=e000000d83349598 > [<a000000100d70100>] dma_direct_alloc+0x170/0x470 > sp=e000000d8334fd50 bsp=e000000d83349518 > [<a0000001006a8770>] swiotlb_alloc+0x50/0x90 > sp=e000000d8334fd50 bsp=e000000d833494d8 > [<a00000010083abd0>] dmam_alloc_coherent+0x250/0x2c0 > sp=e000000d8334fd50 bsp=e000000d83349488 > [<a0000001009990c0>] ahci_port_start+0x2f0/0x4b0 > sp=e000000d8334fd50 bsp=e000000d83349440 > [<a000000100958490>] ata_host_start+0x310/0x470 > sp=e000000d8334fd60 bsp=e000000d833493d0 > [<a000000100964a70>] ata_host_activate+0x20/0x290 > sp=e000000d8334fd60 bsp=e000000d83349370 > [<a000000100999570>] ahci_host_activate+0x2f0/0x300 > sp=e000000d8334fd60 bsp=e000000d83349300 > [<a0000001009923d0>] ahci_init_one+0x1580/0x20b0 > sp=e000000d8334fd60 bsp=e000000d83349258 > [<a0000001006d0610>] local_pci_probe+0x90/0x150 > sp=e000000d8334fdc0 bsp=e000000d83349218 > [<a0000001006d1a30>] pci_device_probe+0x2f0/0x310 > sp=e000000d8334fdc0 bsp=e000000d833491d8 > [<a0000001008229f0>] driver_probe_device+0x520/0x720 > sp=e000000d8334fde0 bsp=e000000d83349170 > [<a000000100822d10>] __driver_attach+0x120/0x190 > sp=e000000d8334fde0 bsp=e000000d83349140 > [<a00000010081ec00>] bus_for_each_dev+0x120/0x140 > sp=e000000d8334fde0 bsp=e000000d83349100 > [<a000000100821bf0>] driver_attach+0x40/0x60 > sp=e000000d8334fdf0 bsp=e000000d833490e0 > [<a0000001008211b0>] bus_add_driver+0x400/0x4a0 > sp=e000000d8334fdf0 bsp=e000000d83349090 > [<a000000100823fc0>] driver_register+0x240/0x2d0 > sp=e000000d8334fdf0 bsp=e000000d83349068 > [<a0000001006cfde0>] __pci_register_driver+0xa0/0xc0 > sp=e000000d8334fdf0 bsp=e000000d83349038 > [<a0000001010ecdb0>] ahci_pci_driver_init+0x50/0x70 > sp=e000000d8334fdf0 bsp=e000000d83349020 > [<a00000010000a950>] do_one_initcall+0x290/0x2a0 > sp=e000000d8334fdf0 bsp=e000000d83348fe0 > [<a0000001010a1c10>] kernel_init_freeable+0x400/0x430 > sp=e000000d8334fe30 bsp=e000000d83348f78 > [<a000000100d93860>] kernel_init+0x20/0x280 > sp=e000000d8334fe30 bsp=e000000d83348f58 > [<a00000010000c1f0>] call_payload+0x50/0x80 > sp=e000000d8334fe30 bsp=e000000d83348f40 > Disabling lock debugging due to kernel taint > Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b > > ---[ end Kernel panic - not syncing: Attempted to kill init! > exitcode=0x0000000b > ``` > > ...but because of the result below - spoiler: a v4.19.37 kernel working > on my rx2800 i2 - I assume they're created by the very same issue. > > Starting at tag v4.19.37 I then reverted the following commits: > > * cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80 [2] > > * 16e73adbca76fd18733278cb688b0ddb4cad162c [3] > > * 9d37c094dacda531ac3e529dd4dd139e3c0b7811 [4] > > * 4fac8076df854aa4ddb8acbf6cce9d337300219e [5] > > * 543cea9accd9804307541cb93d3ed7ec94b07237 [6] > > ...and compiled a kernel using the localmodconfig target to create a > minimal config. The resulting kernel booted fine on my rx2800 i2: > > ``` > Linux version 4.19.37-00005-g55bd603c2590-dirty (root@rx2800-i2) (gcc > version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 20 23:58:57 CEST 2019 > EFI v2.10 by HP: > efi: SALsystab=0xdfdd63a18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 > SMBIOS=0x3d368000 > booting generic kernel on platform dig > PCDP: v3 at 0xdffff8798 > earlycon: uart8250 at I/O port 0x4000 (options '115200n8') > bootconsole [uart8250] enabled > ACPI: Early table checksum verification disabled > ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) > ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 00000001 > 01000013) > [...] > * Starting sshd ... > [ ok ] > * Starting local ... > [ ok ] > > > This is rx2800-i2[...] (Linux ia64 4.19.37-00005-g55bd603c2590-dirty) > 20:49:42 > ``` > > [2]: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80 > > [3]: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=16e73adbca76fd18733278cb688b0ddb4cad162c > > [4]: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9d37c094dacda531ac3e529dd4dd139e3c0b7811 > > [5]: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4fac8076df854aa4ddb8acbf6cce9d337300219e > > [6]: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237 > > **** > > Please note: > > * that I'm always using the "ia64: fix ptrace" patch ([7]) in addition, > as I'm compiling with gcc 7.3.0 on Gentoo; > > [7]: https://lore.kernel.org/patchwork/patch/884685/ > > * that the original problem only shows on my rx2800 i2 and not on my > other ia64 gear (rx4640 with Madison, rx2620 with Montecito and rx2660 > with Montvale), so could be related to the different system architecture > of the Tukwila based rx2800 i2 (UMA => NUMA IIC); > > I just now tried to compile a more recent v5.2-rc5 kernel with the above > commits reverted, but that fails. There seem to have been further > changes made since v4.19.37 for which I would still need to find the > respective commits to revert. But I assume this work could be unneeded > for a further examination of the problem, so I don't follow this for > now. If it is needed please let me know. > > James Clarke already had an idea what could be involved in this issue. > Maybe he can give his assessment. > > If you want me to try a patch for a specific Linux version, please let > me know. The same if you need further information from me. > > Cheers > Frank ---end quoted text---