Hi! I'm CC'ing Michael Karcher, who is really good at tracking down such bugs. Adrian On 6/26/19 5:58 PM, Christoph Hellwig wrote: > I'm running out of luck trying to understand the issue with the > zone list. Adding the ia64 mailing list in addition to Tony > to see if someone can figure out how a alloc_pages_node for the > node stored in an AHCI PCIe pci_dev could cause an oops in the > zonelist lookup. > > On Fri, Jun 21, 2019 at 10:08:06PM +0200, Frank Scheiner wrote: >> Hi there, >> >> recent testing of a Debian v4.19.37 kernel showed a problem on my rx2800 >> i2 happening during kernel boot: >> >> ``` >> [ 0.000000] Linux version 4.19.0-5-itanium >> (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 8.3.0 (Debian >> 8.3.0-10~ia64.1)) #1 SMP Debian 4.19.37-3 (2019-05-18) >> [ 0.000000] EFI v2.10 by HP: >> [ 0.000000] efi: SALsystab=0x6fdd63a18 ACPI 2.0=0x3d3c4014 >> HCDP=0x6ffff8798 SMBIOS=0x3d368000 >> [ 0.000000] booting generic kernel on platform dig >> [ 0.000000] PCDP: v3 at 0x6ffff8798 >> [ 0.000000] earlycon: uart8250 at I/O port 0x4000 (options '115200n8') >> [ 0.000000] bootconsole [uart8250] enabled >> [ 0.000000] ACPI: Early table checksum verification disabled >> [ 0.000000] ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) >> [ 0.000000] ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 >> 00000001 01000013) >> [...] >> [ 13.993718] Unpacking initramfs... >> [...] >> [ 22.655630] Run /init as init process >> [ 22.818930] SCSI subsystem initialized >> [ 22.844653] ACPI: bus type USB registered >> [ 22.878940] HP HPSA Driver (v 3.4.20-125) >> [ 22.930628] usbcore: registered new interface driver usbfs >> [ 23.072034] usbcore: registered new interface driver hub >> [ 23.072925] hpsa 0000:01:00.0: Logical aborts not supported >> [ 23.150942] usbcore: registered new device driver usb >> [ 23.231690] hpsa 0000:01:00.0: HP SSD Smart Path aborts not supported >> [ 23.306942] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver >> [ 23.417101] systemd-udevd[115]: NaT consumption 2216203124768 [1] >> [ 23.488663] ehci-pci: EHCI PCI platform driver >> [ 23.490942] uhci_hcd: USB Universal Host Controller Interface driver >> [ 23.420927] Modules linked in: uhci_hcd(+) ehci_pci(+) ehci_hcd >> hpsa(+) scsi_transport_sas usbcore scsi_mod usb_common >> [ 23.420927] >> [ 23.420927] CPU: 6 PID: 115 Comm: systemd-udevd Not tainted >> 4.19.0-5-itanium #1 Debian 4.19.37-3 >> [ 23.420927] Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012 >> [ 23.420927] psr : 0000121008026010 ifs : 8000000000002046 ip : >> [<a0000001002af041>] Not tainted (4.19.0-5-itanium Debian 4.19.37-3) >> [ 23.420927] ip is at __alloc_pages_nodemask+0x261/0x20c0 >> [ 23.420927] unat: 0000000000000000 pfs : 0000000000000793 rsc : >> 0000000000000003 >> [ 23.420927] rnat: 0000000000000000 bsps: 0000000000000000 pr : >> 85aaa9a99a6a6659 >> [ 23.420927] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: >> 0009804c8a70433f >> [ 23.420927] csd : 0000000000000000 ssd : 0000000000000000 >> [ 23.420927] b0 : a0000001001710e0 b6 : a0000001003948c0 b7 : >> a0000001000469c0 >> [ 23.420927] f6 : 10012bffff00000000000 f7 : 1003e00000000000bffff >> [ 23.420927] f8 : 1003e0000000000003fc0 f9 : 1003effffffffffffffab >> [ 23.420927] f10 : 10016818d087e7cd81a78 f11 : 1003e000000000000002a >> [ 23.420927] r1 : a0000001015d6ba0 r2 : a0000001013643c8 r3 : >> fffffffffffc04b8 >> [ 23.420927] r8 : 0000000000001440 r9 : e000000001507708 r10 : >> 0000000000000008 >> [ 23.420927] r11 : ffffffffffd8d818 r12 : e000000682fcfbd0 r13 : >> e000000682fc8000 >> [ 23.420927] r14 : a0000001013643b8 r15 : ffffffffffd8d828 r16 : >> 00000000007fffff >> [ 23.420927] r17 : 0000000000000008 r18 : 0000000000000000 r19 : >> e000000001507710 >> [ 23.420927] r20 : 0000000000000000 r21 : 0000000000002500 r22 : >> 0000000000000000 >> [ 23.420927] r23 : 0000000000000000 r24 : 0000000000000000 r25 : >> 0000000000000000 >> [ 23.420927] r26 : 0000000000000000 r27 : 0000000000000000 r28 : >> e000000682fc87b0 >> [ 23.420927] r29 : 0000000000200000 r30 : 0000000000000000 r31 : >> 0000000000000000 >> [ 23.420927] >> [ 23.420927] Call Trace: >> [ 23.420927] [<a000000100014bd0>] show_stack+0x90/0xc0 >> [ 23.420927] sp=e000000682fcf790 >> bsp=e000000682fc9c80 >> [ 23.420927] [<a0000001000152d0>] show_regs+0x6d0/0xa00 >> [ 23.420927] sp=e000000682fcf960 >> bsp=e000000682fc9c10 >> [ 23.420927] [<a000000100029330>] die+0x1b0/0x460 >> [ 23.420927] sp=e000000682fcf980 >> bsp=e000000682fc9bc8 >> [ 23.420927] [<a000000100e75100>] ia64_fault+0x5a0/0xf60 >> [ 23.420927] sp=e000000682fcf980 >> bsp=e000000682fc9b70 >> [ 23.420927] [<a00000010000c9c0>] ia64_leave_kernel+0x0/0x270 >> [ 23.420927] sp=e000000682fcfa00 >> bsp=e000000682fc9b70 >> [ 23.420927] [<a0000001002af040>] __alloc_pages_nodemask+0x260/0x20c0 >> [ 23.420927] sp=e000000682fcfbd0 >> bsp=e000000682fc9938 >> [ 23.420927] [<a0000001001710e0>] dma_direct_alloc+0x140/0x2e0 >> [ 23.420927] sp=e000000682fcfc40 >> bsp=e000000682fc98c0 >> [ 23.420927] [<a000000100173910>] swiotlb_alloc+0x50/0x2e0 >> [ 23.420927] sp=e000000682fcfc40 >> bsp=e000000682fc9868 >> ``` >> >> The machine doesn't continue boot afterwards. The machine boots fine >> with a 4.14.x with Gentoo patches but also no later minor kernel version >> with Gentoo patches works on it. With some testing I could limit the >> Linux versions, between which the problematic change could have been >> introduced, to 4.15.x and 4.16.x. Bisecting between tag v4.15.18 (good) >> and tag v4.16-rc1 (bad) pointed to commit >> 543cea9accd9804307541cb93d3ed7ec94b07237 ([1]) as first bad commit. >> >> [1]: >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237 >> >> The kernel messages with problematic kernels from the bisecting process >> look different to the ones from the above shown v4.19.37 from Debian though: >> >> ``` >> Linux version 4.15.0-rc7-00047-g543cea9accd9-dirty (root@rx2800-i2) (gcc >> version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 13 22:16:30 CEST 2019 >> EFI v2.10 by HP: >> efi: SALsystab=0xdfdd63a18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 >> SMBIOS=0x3d368000 >> booting generic kernel on platform dig >> PCDP: v3 at 0xdffff8798 >> earlycon: uart8250 at I/O port 0x4000 (options '115200n8') >> bootconsole [uart8250] enabled >> ACPI: Early table checksum verification disabled >> ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) >> ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 00000001 >> 01000013) >> [...] >> Trying to unpack rootfs image as initramfs... >> [...] >> Loading Adaptec I2O RAID: Version 2.4 Build 5go >> Detecting Adaptec I2O RAID controllers... >> ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA >> mode >> ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ccc ems >> Unable to handle kernel NULL pointer dereference (address 0000000000001688) >> swapper/0[1]: Oops 11012296146944 [1] >> Modules linked in: >> >> CPU: 0 PID: 1 Comm: swapper/0 Not tainted >> 4.15.0-rc7-00047-g543cea9accd9-dirty #1 >> Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012 >> psr : 00001210084a6010 ifs : 8000000000001734 ip : [<a000000100180401>] >> Not tainted (4.15.0-rc7-00047-g543cea9accd9-dirty) >> ip is at __alloc_pages_nodemask+0x1a1/0x1670 >> unat: 0000000000000000 pfs : 0000000000001734 rsc : 0000000000000003 >> rnat: 000000038c5ad78d bsps: 000000000001003e pr : 565595a66aa65799 >> ldrs: 0000000000000000 ccv : 000000032e40a799 fpsr: 0009804c8a70433f >> csd : 0000000000000000 ssd : 0000000000000000 >> b0 : a0000001001802c0 b6 : a000000100050b50 b7 : a0000001007e83d0 >> f6 : 1003e0000000000000000 f7 : 1003e00000000000164ff >> f8 : 1003e0000000000000f00 f9 : 1003e000000000000000f >> f10 : 1003e0000000000000400 f11 : 1003e0000000000003c00 >> r1 : a00000010155edc0 r2 : a0000001012b5e90 r3 : 0000000001ffffff >> r8 : 0000000000001680 r9 : 0000000000250015 r10 : e000000001519980 >> r11 : e000000001519988 r12 : e000000d8334fcf0 r13 : e000000d83348000 >> r14 : ffffffffffd570d0 r15 : 0000000000000008 r16 : e000000001519990 >> r17 : 0000000000000000 r18 : 0000000000001680 r19 : 0000000000000000 >> r20 : 0000000000000000 r21 : 0000000000000000 r22 : 0000000000000000 >> r23 : 0000000000000000 r24 : ffffffffffd570c0 r25 : a0000001012b5e80 >> r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000001688 >> r29 : 0000000000000358 r30 : 0000000000000000 r31 : 0000000000000081 >> >> Call Trace: >> [<a000000100013760>] show_stack+0x40/0x90 >> sp=e000000d8334f8c0 bsp=e000000d83349890 >> [<a0000001000140e0>] show_regs+0x930/0x940 >> sp=e000000d8334fa90 bsp=e000000d83349820 >> [<a00000010003a7d0>] die+0x1a0/0x2f0 >> sp=e000000d8334fa90 bsp=e000000d833497d8 >> [<a000000100063140>] ia64_do_page_fault+0x830/0xa30 >> sp=e000000d8334fa90 bsp=e000000d83349740 >> [<a00000010000c400>] ia64_leave_kernel+0x0/0x270 >> sp=e000000d8334fb20 bsp=e000000d83349740 >> [<a000000100180400>] __alloc_pages_nodemask+0x1a0/0x1670 >> sp=e000000d8334fcf0 bsp=e000000d83349598 >> [<a000000100d70100>] dma_direct_alloc+0x170/0x470 >> sp=e000000d8334fd50 bsp=e000000d83349518 >> [<a0000001006a8770>] swiotlb_alloc+0x50/0x90 >> sp=e000000d8334fd50 bsp=e000000d833494d8 >> [<a00000010083abd0>] dmam_alloc_coherent+0x250/0x2c0 >> sp=e000000d8334fd50 bsp=e000000d83349488 >> [<a0000001009990c0>] ahci_port_start+0x2f0/0x4b0 >> sp=e000000d8334fd50 bsp=e000000d83349440 >> [<a000000100958490>] ata_host_start+0x310/0x470 >> sp=e000000d8334fd60 bsp=e000000d833493d0 >> [<a000000100964a70>] ata_host_activate+0x20/0x290 >> sp=e000000d8334fd60 bsp=e000000d83349370 >> [<a000000100999570>] ahci_host_activate+0x2f0/0x300 >> sp=e000000d8334fd60 bsp=e000000d83349300 >> [<a0000001009923d0>] ahci_init_one+0x1580/0x20b0 >> sp=e000000d8334fd60 bsp=e000000d83349258 >> [<a0000001006d0610>] local_pci_probe+0x90/0x150 >> sp=e000000d8334fdc0 bsp=e000000d83349218 >> [<a0000001006d1a30>] pci_device_probe+0x2f0/0x310 >> sp=e000000d8334fdc0 bsp=e000000d833491d8 >> [<a0000001008229f0>] driver_probe_device+0x520/0x720 >> sp=e000000d8334fde0 bsp=e000000d83349170 >> [<a000000100822d10>] __driver_attach+0x120/0x190 >> sp=e000000d8334fde0 bsp=e000000d83349140 >> [<a00000010081ec00>] bus_for_each_dev+0x120/0x140 >> sp=e000000d8334fde0 bsp=e000000d83349100 >> [<a000000100821bf0>] driver_attach+0x40/0x60 >> sp=e000000d8334fdf0 bsp=e000000d833490e0 >> [<a0000001008211b0>] bus_add_driver+0x400/0x4a0 >> sp=e000000d8334fdf0 bsp=e000000d83349090 >> [<a000000100823fc0>] driver_register+0x240/0x2d0 >> sp=e000000d8334fdf0 bsp=e000000d83349068 >> [<a0000001006cfde0>] __pci_register_driver+0xa0/0xc0 >> sp=e000000d8334fdf0 bsp=e000000d83349038 >> [<a0000001010ecdb0>] ahci_pci_driver_init+0x50/0x70 >> sp=e000000d8334fdf0 bsp=e000000d83349020 >> [<a00000010000a950>] do_one_initcall+0x290/0x2a0 >> sp=e000000d8334fdf0 bsp=e000000d83348fe0 >> [<a0000001010a1c10>] kernel_init_freeable+0x400/0x430 >> sp=e000000d8334fe30 bsp=e000000d83348f78 >> [<a000000100d93860>] kernel_init+0x20/0x280 >> sp=e000000d8334fe30 bsp=e000000d83348f58 >> [<a00000010000c1f0>] call_payload+0x50/0x80 >> sp=e000000d8334fe30 bsp=e000000d83348f40 >> Disabling lock debugging due to kernel taint >> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b >> >> ---[ end Kernel panic - not syncing: Attempted to kill init! >> exitcode=0x0000000b >> ``` >> >> ...but because of the result below - spoiler: a v4.19.37 kernel working >> on my rx2800 i2 - I assume they're created by the very same issue. >> >> Starting at tag v4.19.37 I then reverted the following commits: >> >> * cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80 [2] >> >> * 16e73adbca76fd18733278cb688b0ddb4cad162c [3] >> >> * 9d37c094dacda531ac3e529dd4dd139e3c0b7811 [4] >> >> * 4fac8076df854aa4ddb8acbf6cce9d337300219e [5] >> >> * 543cea9accd9804307541cb93d3ed7ec94b07237 [6] >> >> ...and compiled a kernel using the localmodconfig target to create a >> minimal config. The resulting kernel booted fine on my rx2800 i2: >> >> ``` >> Linux version 4.19.37-00005-g55bd603c2590-dirty (root@rx2800-i2) (gcc >> version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 20 23:58:57 CEST 2019 >> EFI v2.10 by HP: >> efi: SALsystab=0xdfdd63a18 ACPI 2.0=0x3d3c4014 HCDP=0xdffff8798 >> SMBIOS=0x3d368000 >> booting generic kernel on platform dig >> PCDP: v3 at 0xdffff8798 >> earlycon: uart8250 at I/O port 0x4000 (options '115200n8') >> bootconsole [uart8250] enabled >> ACPI: Early table checksum verification disabled >> ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP ) >> ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP RX2800-2 00000001 >> 01000013) >> [...] >> * Starting sshd ... >> [ ok ] >> * Starting local ... >> [ ok ] >> >> >> This is rx2800-i2[...] (Linux ia64 4.19.37-00005-g55bd603c2590-dirty) >> 20:49:42 >> ``` >> >> [2]: >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80 >> >> [3]: >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=16e73adbca76fd18733278cb688b0ddb4cad162c >> >> [4]: >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9d37c094dacda531ac3e529dd4dd139e3c0b7811 >> >> [5]: >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4fac8076df854aa4ddb8acbf6cce9d337300219e >> >> [6]: >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237 >> >> **** >> >> Please note: >> >> * that I'm always using the "ia64: fix ptrace" patch ([7]) in addition, >> as I'm compiling with gcc 7.3.0 on Gentoo; >> >> [7]: https://lore.kernel.org/patchwork/patch/884685/ >> >> * that the original problem only shows on my rx2800 i2 and not on my >> other ia64 gear (rx4640 with Madison, rx2620 with Montecito and rx2660 >> with Montvale), so could be related to the different system architecture >> of the Tukwila based rx2800 i2 (UMA => NUMA IIC); >> >> I just now tried to compile a more recent v5.2-rc5 kernel with the above >> commits reverted, but that fails. There seem to have been further >> changes made since v4.19.37 for which I would still need to find the >> respective commits to revert. But I assume this work could be unneeded >> for a further examination of the problem, so I don't follow this for >> now. If it is needed please let me know. >> >> James Clarke already had an idea what could be involved in this issue. >> Maybe he can give his assessment. >> >> If you want me to try a patch for a specific Linux version, please let >> me know. The same if you need further information from me. >> >> Cheers >> Frank > ---end quoted text--- > -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - glaubitz@xxxxxxxxxx `. `' Freie Universitaet Berlin - glaubitz@xxxxxxxxxxxxxxxxxxx `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913