Re: Kernel problem on rx2800 i2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

I'm CC'ing Michael Karcher, who is really good at tracking down
such bugs.

Adrian

On 6/26/19 5:58 PM, Christoph Hellwig wrote:
> I'm running out of luck trying to understand the issue with the
> zone list.  Adding the ia64 mailing list in addition to Tony
> to see if someone can figure out how a alloc_pages_node for the
> node stored in an AHCI PCIe pci_dev could cause an oops in the
> zonelist lookup.
> 
> On Fri, Jun 21, 2019 at 10:08:06PM +0200, Frank Scheiner wrote:
>> Hi there,
>>
>> recent testing of a Debian v4.19.37 kernel showed a problem on my rx2800
>> i2 happening during kernel boot:
>>
>> ```
>> [    0.000000] Linux version 4.19.0-5-itanium
>> (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 8.3.0 (Debian
>> 8.3.0-10~ia64.1)) #1 SMP Debian 4.19.37-3 (2019-05-18)
>> [    0.000000] EFI v2.10 by HP:
>> [    0.000000] efi:  SALsystab=0x6fdd63a18  ACPI 2.0=0x3d3c4014
>> HCDP=0x6ffff8798  SMBIOS=0x3d368000
>> [    0.000000] booting generic kernel on platform dig
>> [    0.000000] PCDP: v3 at 0x6ffff8798
>> [    0.000000] earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
>> [    0.000000] bootconsole [uart8250] enabled
>> [    0.000000] ACPI: Early table checksum verification disabled
>> [    0.000000] ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
>> [    0.000000] ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2
>> 00000001      01000013)
>> [...]
>> [   13.993718] Unpacking initramfs...
>> [...]
>> [   22.655630] Run /init as init process
>> [   22.818930] SCSI subsystem initialized
>> [   22.844653] ACPI: bus type USB registered
>> [   22.878940] HP HPSA Driver (v 3.4.20-125)
>> [   22.930628] usbcore: registered new interface driver usbfs
>> [   23.072034] usbcore: registered new interface driver hub
>> [   23.072925] hpsa 0000:01:00.0: Logical aborts not supported
>> [   23.150942] usbcore: registered new device driver usb
>> [   23.231690] hpsa 0000:01:00.0: HP SSD Smart Path aborts not supported
>> [   23.306942] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
>> [   23.417101] systemd-udevd[115]: NaT consumption 2216203124768 [1]
>> [   23.488663] ehci-pci: EHCI PCI platform driver
>> [   23.490942] uhci_hcd: USB Universal Host Controller Interface driver
>> [   23.420927] Modules linked in: uhci_hcd(+) ehci_pci(+) ehci_hcd
>> hpsa(+) scsi_transport_sas usbcore scsi_mod usb_common
>> [   23.420927]
>> [   23.420927] CPU: 6 PID: 115 Comm: systemd-udevd Not tainted
>> 4.19.0-5-itanium #1 Debian 4.19.37-3
>> [   23.420927] Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012
>> [   23.420927] psr : 0000121008026010 ifs : 8000000000002046 ip  :
>> [<a0000001002af041>]    Not tainted (4.19.0-5-itanium Debian 4.19.37-3)
>> [   23.420927] ip is at __alloc_pages_nodemask+0x261/0x20c0
>> [   23.420927] unat: 0000000000000000 pfs : 0000000000000793 rsc :
>> 0000000000000003
>> [   23.420927] rnat: 0000000000000000 bsps: 0000000000000000 pr  :
>> 85aaa9a99a6a6659
>> [   23.420927] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr:
>> 0009804c8a70433f
>> [   23.420927] csd : 0000000000000000 ssd : 0000000000000000
>> [   23.420927] b0  : a0000001001710e0 b6  : a0000001003948c0 b7  :
>> a0000001000469c0
>> [   23.420927] f6  : 10012bffff00000000000 f7  : 1003e00000000000bffff
>> [   23.420927] f8  : 1003e0000000000003fc0 f9  : 1003effffffffffffffab
>> [   23.420927] f10 : 10016818d087e7cd81a78 f11 : 1003e000000000000002a
>> [   23.420927] r1  : a0000001015d6ba0 r2  : a0000001013643c8 r3  :
>> fffffffffffc04b8
>> [   23.420927] r8  : 0000000000001440 r9  : e000000001507708 r10 :
>> 0000000000000008
>> [   23.420927] r11 : ffffffffffd8d818 r12 : e000000682fcfbd0 r13 :
>> e000000682fc8000
>> [   23.420927] r14 : a0000001013643b8 r15 : ffffffffffd8d828 r16 :
>> 00000000007fffff
>> [   23.420927] r17 : 0000000000000008 r18 : 0000000000000000 r19 :
>> e000000001507710
>> [   23.420927] r20 : 0000000000000000 r21 : 0000000000002500 r22 :
>> 0000000000000000
>> [   23.420927] r23 : 0000000000000000 r24 : 0000000000000000 r25 :
>> 0000000000000000
>> [   23.420927] r26 : 0000000000000000 r27 : 0000000000000000 r28 :
>> e000000682fc87b0
>> [   23.420927] r29 : 0000000000200000 r30 : 0000000000000000 r31 :
>> 0000000000000000
>> [   23.420927]
>> [   23.420927] Call Trace:
>> [   23.420927]  [<a000000100014bd0>] show_stack+0x90/0xc0
>> [   23.420927]                                 sp=e000000682fcf790
>> bsp=e000000682fc9c80
>> [   23.420927]  [<a0000001000152d0>] show_regs+0x6d0/0xa00
>> [   23.420927]                                 sp=e000000682fcf960
>> bsp=e000000682fc9c10
>> [   23.420927]  [<a000000100029330>] die+0x1b0/0x460
>> [   23.420927]                                 sp=e000000682fcf980
>> bsp=e000000682fc9bc8
>> [   23.420927]  [<a000000100e75100>] ia64_fault+0x5a0/0xf60
>> [   23.420927]                                 sp=e000000682fcf980
>> bsp=e000000682fc9b70
>> [   23.420927]  [<a00000010000c9c0>] ia64_leave_kernel+0x0/0x270
>> [   23.420927]                                 sp=e000000682fcfa00
>> bsp=e000000682fc9b70
>> [   23.420927]  [<a0000001002af040>] __alloc_pages_nodemask+0x260/0x20c0
>> [   23.420927]                                 sp=e000000682fcfbd0
>> bsp=e000000682fc9938
>> [   23.420927]  [<a0000001001710e0>] dma_direct_alloc+0x140/0x2e0
>> [   23.420927]                                 sp=e000000682fcfc40
>> bsp=e000000682fc98c0
>> [   23.420927]  [<a000000100173910>] swiotlb_alloc+0x50/0x2e0
>> [   23.420927]                                 sp=e000000682fcfc40
>> bsp=e000000682fc9868
>> ```
>>
>> The machine doesn't continue boot afterwards. The machine boots fine
>> with a 4.14.x with Gentoo patches but also no later minor kernel version
>> with Gentoo patches works on it. With some testing I could limit the
>> Linux versions, between which the problematic change could have been
>> introduced, to 4.15.x and 4.16.x. Bisecting between tag v4.15.18 (good)
>> and tag v4.16-rc1 (bad) pointed to commit
>> 543cea9accd9804307541cb93d3ed7ec94b07237 ([1]) as first bad commit.
>>
>> [1]:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237
>>
>> The kernel messages with problematic kernels from the bisecting process
>> look different to the ones from the above shown v4.19.37 from Debian though:
>>
>> ```
>> Linux version 4.15.0-rc7-00047-g543cea9accd9-dirty (root@rx2800-i2) (gcc
>> version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 13 22:16:30 CEST 2019
>> EFI v2.10 by HP:
>> efi:  SALsystab=0xdfdd63a18  ACPI 2.0=0x3d3c4014  HCDP=0xdffff8798
>> SMBIOS=0x3d368000
>> booting generic kernel on platform dig
>> PCDP: v3 at 0xdffff8798
>> earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
>> bootconsole [uart8250] enabled
>> ACPI: Early table checksum verification disabled
>> ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
>> ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001
>> 01000013)
>> [...]
>> Trying to unpack rootfs image as initramfs...
>> [...]
>> Loading Adaptec I2O RAID: Version 2.4 Build 5go
>> Detecting Adaptec I2O RAID controllers...
>> ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA
>> mode
>> ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ccc ems
>> Unable to handle kernel NULL pointer dereference (address 0000000000001688)
>> swapper/0[1]: Oops 11012296146944 [1]
>> Modules linked in:
>>
>> CPU: 0 PID: 1 Comm: swapper/0 Not tainted
>> 4.15.0-rc7-00047-g543cea9accd9-dirty #1
>> Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012
>> psr : 00001210084a6010 ifs : 8000000000001734 ip  : [<a000000100180401>]
>>    Not tainted (4.15.0-rc7-00047-g543cea9accd9-dirty)
>> ip is at __alloc_pages_nodemask+0x1a1/0x1670
>> unat: 0000000000000000 pfs : 0000000000001734 rsc : 0000000000000003
>> rnat: 000000038c5ad78d bsps: 000000000001003e pr  : 565595a66aa65799
>> ldrs: 0000000000000000 ccv : 000000032e40a799 fpsr: 0009804c8a70433f
>> csd : 0000000000000000 ssd : 0000000000000000
>> b0  : a0000001001802c0 b6  : a000000100050b50 b7  : a0000001007e83d0
>> f6  : 1003e0000000000000000 f7  : 1003e00000000000164ff
>> f8  : 1003e0000000000000f00 f9  : 1003e000000000000000f
>> f10 : 1003e0000000000000400 f11 : 1003e0000000000003c00
>> r1  : a00000010155edc0 r2  : a0000001012b5e90 r3  : 0000000001ffffff
>> r8  : 0000000000001680 r9  : 0000000000250015 r10 : e000000001519980
>> r11 : e000000001519988 r12 : e000000d8334fcf0 r13 : e000000d83348000
>> r14 : ffffffffffd570d0 r15 : 0000000000000008 r16 : e000000001519990
>> r17 : 0000000000000000 r18 : 0000000000001680 r19 : 0000000000000000
>> r20 : 0000000000000000 r21 : 0000000000000000 r22 : 0000000000000000
>> r23 : 0000000000000000 r24 : ffffffffffd570c0 r25 : a0000001012b5e80
>> r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000001688
>> r29 : 0000000000000358 r30 : 0000000000000000 r31 : 0000000000000081
>>
>> Call Trace:
>>  [<a000000100013760>] show_stack+0x40/0x90
>>                                 sp=e000000d8334f8c0 bsp=e000000d83349890
>>  [<a0000001000140e0>] show_regs+0x930/0x940
>>                                 sp=e000000d8334fa90 bsp=e000000d83349820
>>  [<a00000010003a7d0>] die+0x1a0/0x2f0
>>                                 sp=e000000d8334fa90 bsp=e000000d833497d8
>>  [<a000000100063140>] ia64_do_page_fault+0x830/0xa30
>>                                 sp=e000000d8334fa90 bsp=e000000d83349740
>>  [<a00000010000c400>] ia64_leave_kernel+0x0/0x270
>>                                 sp=e000000d8334fb20 bsp=e000000d83349740
>>  [<a000000100180400>] __alloc_pages_nodemask+0x1a0/0x1670
>>                                 sp=e000000d8334fcf0 bsp=e000000d83349598
>>  [<a000000100d70100>] dma_direct_alloc+0x170/0x470
>>                                 sp=e000000d8334fd50 bsp=e000000d83349518
>>  [<a0000001006a8770>] swiotlb_alloc+0x50/0x90
>>                                 sp=e000000d8334fd50 bsp=e000000d833494d8
>>  [<a00000010083abd0>] dmam_alloc_coherent+0x250/0x2c0
>>                                 sp=e000000d8334fd50 bsp=e000000d83349488
>>  [<a0000001009990c0>] ahci_port_start+0x2f0/0x4b0
>>                                 sp=e000000d8334fd50 bsp=e000000d83349440
>>  [<a000000100958490>] ata_host_start+0x310/0x470
>>                                 sp=e000000d8334fd60 bsp=e000000d833493d0
>>  [<a000000100964a70>] ata_host_activate+0x20/0x290
>>                                 sp=e000000d8334fd60 bsp=e000000d83349370
>>  [<a000000100999570>] ahci_host_activate+0x2f0/0x300
>>                                 sp=e000000d8334fd60 bsp=e000000d83349300
>>  [<a0000001009923d0>] ahci_init_one+0x1580/0x20b0
>>                                 sp=e000000d8334fd60 bsp=e000000d83349258
>>  [<a0000001006d0610>] local_pci_probe+0x90/0x150
>>                                 sp=e000000d8334fdc0 bsp=e000000d83349218
>>  [<a0000001006d1a30>] pci_device_probe+0x2f0/0x310
>>                                 sp=e000000d8334fdc0 bsp=e000000d833491d8
>>  [<a0000001008229f0>] driver_probe_device+0x520/0x720
>>                                 sp=e000000d8334fde0 bsp=e000000d83349170
>>  [<a000000100822d10>] __driver_attach+0x120/0x190
>>                                 sp=e000000d8334fde0 bsp=e000000d83349140
>>  [<a00000010081ec00>] bus_for_each_dev+0x120/0x140
>>                                 sp=e000000d8334fde0 bsp=e000000d83349100
>>  [<a000000100821bf0>] driver_attach+0x40/0x60
>>                                 sp=e000000d8334fdf0 bsp=e000000d833490e0
>>  [<a0000001008211b0>] bus_add_driver+0x400/0x4a0
>>                                 sp=e000000d8334fdf0 bsp=e000000d83349090
>>  [<a000000100823fc0>] driver_register+0x240/0x2d0
>>                                 sp=e000000d8334fdf0 bsp=e000000d83349068
>>  [<a0000001006cfde0>] __pci_register_driver+0xa0/0xc0
>>                                 sp=e000000d8334fdf0 bsp=e000000d83349038
>>  [<a0000001010ecdb0>] ahci_pci_driver_init+0x50/0x70
>>                                 sp=e000000d8334fdf0 bsp=e000000d83349020
>>  [<a00000010000a950>] do_one_initcall+0x290/0x2a0
>>                                 sp=e000000d8334fdf0 bsp=e000000d83348fe0
>>  [<a0000001010a1c10>] kernel_init_freeable+0x400/0x430
>>                                 sp=e000000d8334fe30 bsp=e000000d83348f78
>>  [<a000000100d93860>] kernel_init+0x20/0x280
>>                                 sp=e000000d8334fe30 bsp=e000000d83348f58
>>  [<a00000010000c1f0>] call_payload+0x50/0x80
>>                                 sp=e000000d8334fe30 bsp=e000000d83348f40
>> Disabling lock debugging due to kernel taint
>> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>>
>> ---[ end Kernel panic - not syncing: Attempted to kill init!
>> exitcode=0x0000000b
>> ```
>>
>> ...but because of the result below - spoiler: a v4.19.37 kernel working
>> on my rx2800 i2 - I assume they're created by the very same issue.
>>
>> Starting at tag v4.19.37 I then reverted the following commits:
>>
>> * cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80 [2]
>>
>> * 16e73adbca76fd18733278cb688b0ddb4cad162c [3]
>>
>> * 9d37c094dacda531ac3e529dd4dd139e3c0b7811 [4]
>>
>> * 4fac8076df854aa4ddb8acbf6cce9d337300219e [5]
>>
>> * 543cea9accd9804307541cb93d3ed7ec94b07237 [6]
>>
>> ...and compiled a kernel using the localmodconfig target to create a
>> minimal config. The resulting kernel booted fine on my rx2800 i2:
>>
>> ```
>> Linux version 4.19.37-00005-g55bd603c2590-dirty (root@rx2800-i2) (gcc
>> version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 20 23:58:57 CEST 2019
>> EFI v2.10 by HP:
>> efi:  SALsystab=0xdfdd63a18  ACPI 2.0=0x3d3c4014  HCDP=0xdffff8798
>> SMBIOS=0x3d368000
>> booting generic kernel on platform dig
>> PCDP: v3 at 0xdffff8798
>> earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
>> bootconsole [uart8250] enabled
>> ACPI: Early table checksum verification disabled
>> ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
>> ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001
>> 01000013)
>> [...]
>>  * Starting sshd ...
>>  [ ok ]
>>  * Starting local ...
>>  [ ok ]
>>
>>
>> This is rx2800-i2[...] (Linux ia64 4.19.37-00005-g55bd603c2590-dirty)
>> 20:49:42
>> ```
>>
>> [2]:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80
>>
>> [3]:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=16e73adbca76fd18733278cb688b0ddb4cad162c
>>
>> [4]:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9d37c094dacda531ac3e529dd4dd139e3c0b7811
>>
>> [5]:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4fac8076df854aa4ddb8acbf6cce9d337300219e
>>
>> [6]:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237
>>
>> ****
>>
>> Please note:
>>
>> * that I'm always using the "ia64: fix ptrace" patch ([7]) in addition,
>> as I'm compiling with gcc 7.3.0 on Gentoo;
>>
>> [7]: https://lore.kernel.org/patchwork/patch/884685/
>>
>> * that the original problem only shows on my rx2800 i2 and not on my
>> other ia64 gear (rx4640 with Madison, rx2620 with Montecito and rx2660
>> with Montvale), so could be related to the different system architecture
>> of the Tukwila based rx2800 i2 (UMA => NUMA IIC);
>>
>> I just now tried to compile a more recent v5.2-rc5 kernel with the above
>> commits reverted, but that fails. There seem to have been further
>> changes made since v4.19.37 for which I would still need to find the
>> respective commits to revert. But I assume this work could be unneeded
>> for a further examination of the problem, so I don't follow this for
>> now. If it is needed please let me know.
>>
>> James Clarke already had an idea what could be involved in this issue.
>> Maybe he can give his assessment.
>>
>> If you want me to try a patch for a specific Linux version, please let
>> me know. The same if you need further information from me.
>>
>> Cheers
>> Frank
> ---end quoted text---
> 

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaubitz@xxxxxxxxxx
`. `'   Freie Universitaet Berlin - glaubitz@xxxxxxxxxxxxxxxxxxx
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



[Index of Archives]     [Linux Kernel]     [Sparc Linux]     [DCCP]     [Linux ARM]     [Yosemite News]     [Linux SCSI]     [Linux x86_64]     [Linux for Ham Radio]

  Powered by Linux