Kernel problem on rx2800 i2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi there,

recent testing of a Debian v4.19.37 kernel showed a problem on my rx2800
i2 happening during kernel boot:

```
[    0.000000] Linux version 4.19.0-5-itanium
(debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 8.3.0 (Debian
8.3.0-10~ia64.1)) #1 SMP Debian 4.19.37-3 (2019-05-18)
[    0.000000] EFI v2.10 by HP:
[    0.000000] efi:  SALsystab=0x6fdd63a18  ACPI 2.0=0x3d3c4014
HCDP=0x6ffff8798  SMBIOS=0x3d368000
[    0.000000] booting generic kernel on platform dig
[    0.000000] PCDP: v3 at 0x6ffff8798
[    0.000000] earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
[    0.000000] bootconsole [uart8250] enabled
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
[    0.000000] ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2
00000001      01000013)
[...]
[   13.993718] Unpacking initramfs...
[...]
[   22.655630] Run /init as init process
[   22.818930] SCSI subsystem initialized
[   22.844653] ACPI: bus type USB registered
[   22.878940] HP HPSA Driver (v 3.4.20-125)
[   22.930628] usbcore: registered new interface driver usbfs
[   23.072034] usbcore: registered new interface driver hub
[   23.072925] hpsa 0000:01:00.0: Logical aborts not supported
[   23.150942] usbcore: registered new device driver usb
[   23.231690] hpsa 0000:01:00.0: HP SSD Smart Path aborts not supported
[   23.306942] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[   23.417101] systemd-udevd[115]: NaT consumption 2216203124768 [1]
[   23.488663] ehci-pci: EHCI PCI platform driver
[   23.490942] uhci_hcd: USB Universal Host Controller Interface driver
[   23.420927] Modules linked in: uhci_hcd(+) ehci_pci(+) ehci_hcd
hpsa(+) scsi_transport_sas usbcore scsi_mod usb_common
[   23.420927]
[   23.420927] CPU: 6 PID: 115 Comm: systemd-udevd Not tainted
4.19.0-5-itanium #1 Debian 4.19.37-3
[   23.420927] Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012
[   23.420927] psr : 0000121008026010 ifs : 8000000000002046 ip  :
[<a0000001002af041>]    Not tainted (4.19.0-5-itanium Debian 4.19.37-3)
[   23.420927] ip is at __alloc_pages_nodemask+0x261/0x20c0
[   23.420927] unat: 0000000000000000 pfs : 0000000000000793 rsc :
0000000000000003
[   23.420927] rnat: 0000000000000000 bsps: 0000000000000000 pr  :
85aaa9a99a6a6659
[   23.420927] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr:
0009804c8a70433f
[   23.420927] csd : 0000000000000000 ssd : 0000000000000000
[   23.420927] b0  : a0000001001710e0 b6  : a0000001003948c0 b7  :
a0000001000469c0
[   23.420927] f6  : 10012bffff00000000000 f7  : 1003e00000000000bffff
[   23.420927] f8  : 1003e0000000000003fc0 f9  : 1003effffffffffffffab
[   23.420927] f10 : 10016818d087e7cd81a78 f11 : 1003e000000000000002a
[   23.420927] r1  : a0000001015d6ba0 r2  : a0000001013643c8 r3  :
fffffffffffc04b8
[   23.420927] r8  : 0000000000001440 r9  : e000000001507708 r10 :
0000000000000008
[   23.420927] r11 : ffffffffffd8d818 r12 : e000000682fcfbd0 r13 :
e000000682fc8000
[   23.420927] r14 : a0000001013643b8 r15 : ffffffffffd8d828 r16 :
00000000007fffff
[   23.420927] r17 : 0000000000000008 r18 : 0000000000000000 r19 :
e000000001507710
[   23.420927] r20 : 0000000000000000 r21 : 0000000000002500 r22 :
0000000000000000
[   23.420927] r23 : 0000000000000000 r24 : 0000000000000000 r25 :
0000000000000000
[   23.420927] r26 : 0000000000000000 r27 : 0000000000000000 r28 :
e000000682fc87b0
[   23.420927] r29 : 0000000000200000 r30 : 0000000000000000 r31 :
0000000000000000
[   23.420927]
[   23.420927] Call Trace:
[   23.420927]  [<a000000100014bd0>] show_stack+0x90/0xc0
[   23.420927]                                 sp=e000000682fcf790
bsp=e000000682fc9c80
[   23.420927]  [<a0000001000152d0>] show_regs+0x6d0/0xa00
[   23.420927]                                 sp=e000000682fcf960
bsp=e000000682fc9c10
[   23.420927]  [<a000000100029330>] die+0x1b0/0x460
[   23.420927]                                 sp=e000000682fcf980
bsp=e000000682fc9bc8
[   23.420927]  [<a000000100e75100>] ia64_fault+0x5a0/0xf60
[   23.420927]                                 sp=e000000682fcf980
bsp=e000000682fc9b70
[   23.420927]  [<a00000010000c9c0>] ia64_leave_kernel+0x0/0x270
[   23.420927]                                 sp=e000000682fcfa00
bsp=e000000682fc9b70
[   23.420927]  [<a0000001002af040>] __alloc_pages_nodemask+0x260/0x20c0
[   23.420927]                                 sp=e000000682fcfbd0
bsp=e000000682fc9938
[   23.420927]  [<a0000001001710e0>] dma_direct_alloc+0x140/0x2e0
[   23.420927]                                 sp=e000000682fcfc40
bsp=e000000682fc98c0
[   23.420927]  [<a000000100173910>] swiotlb_alloc+0x50/0x2e0
[   23.420927]                                 sp=e000000682fcfc40
bsp=e000000682fc9868
```

The machine doesn't continue boot afterwards. The machine boots fine
with a 4.14.x with Gentoo patches but also no later minor kernel version
with Gentoo patches works on it. With some testing I could limit the
Linux versions, between which the problematic change could have been
introduced, to 4.15.x and 4.16.x. Bisecting between tag v4.15.18 (good)
and tag v4.16-rc1 (bad) pointed to commit
543cea9accd9804307541cb93d3ed7ec94b07237 ([1]) as first bad commit.

[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237

The kernel messages with problematic kernels from the bisecting process
look different to the ones from the above shown v4.19.37 from Debian though:

```
Linux version 4.15.0-rc7-00047-g543cea9accd9-dirty (root@rx2800-i2) (gcc
version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 13 22:16:30 CEST 2019
EFI v2.10 by HP:
efi:  SALsystab=0xdfdd63a18  ACPI 2.0=0x3d3c4014  HCDP=0xdffff8798
SMBIOS=0x3d368000
booting generic kernel on platform dig
PCDP: v3 at 0xdffff8798
earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
bootconsole [uart8250] enabled
ACPI: Early table checksum verification disabled
ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001
01000013)
[...]
Trying to unpack rootfs image as initramfs...
[...]
Loading Adaptec I2O RAID: Version 2.4 Build 5go
Detecting Adaptec I2O RAID controllers...
ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA
mode
ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ccc ems
Unable to handle kernel NULL pointer dereference (address 0000000000001688)
swapper/0[1]: Oops 11012296146944 [1]
Modules linked in:

CPU: 0 PID: 1 Comm: swapper/0 Not tainted
4.15.0-rc7-00047-g543cea9accd9-dirty #1
Hardware name: hp Integrity rx2800 i2, BIOS 01.93 09/12/2012
psr : 00001210084a6010 ifs : 8000000000001734 ip  : [<a000000100180401>]
   Not tainted (4.15.0-rc7-00047-g543cea9accd9-dirty)
ip is at __alloc_pages_nodemask+0x1a1/0x1670
unat: 0000000000000000 pfs : 0000000000001734 rsc : 0000000000000003
rnat: 000000038c5ad78d bsps: 000000000001003e pr  : 565595a66aa65799
ldrs: 0000000000000000 ccv : 000000032e40a799 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001001802c0 b6  : a000000100050b50 b7  : a0000001007e83d0
f6  : 1003e0000000000000000 f7  : 1003e00000000000164ff
f8  : 1003e0000000000000f00 f9  : 1003e000000000000000f
f10 : 1003e0000000000000400 f11 : 1003e0000000000003c00
r1  : a00000010155edc0 r2  : a0000001012b5e90 r3  : 0000000001ffffff
r8  : 0000000000001680 r9  : 0000000000250015 r10 : e000000001519980
r11 : e000000001519988 r12 : e000000d8334fcf0 r13 : e000000d83348000
r14 : ffffffffffd570d0 r15 : 0000000000000008 r16 : e000000001519990
r17 : 0000000000000000 r18 : 0000000000001680 r19 : 0000000000000000
r20 : 0000000000000000 r21 : 0000000000000000 r22 : 0000000000000000
r23 : 0000000000000000 r24 : ffffffffffd570c0 r25 : a0000001012b5e80
r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000001688
r29 : 0000000000000358 r30 : 0000000000000000 r31 : 0000000000000081

Call Trace:
 [<a000000100013760>] show_stack+0x40/0x90
                                sp=e000000d8334f8c0 bsp=e000000d83349890
 [<a0000001000140e0>] show_regs+0x930/0x940
                                sp=e000000d8334fa90 bsp=e000000d83349820
 [<a00000010003a7d0>] die+0x1a0/0x2f0
                                sp=e000000d8334fa90 bsp=e000000d833497d8
 [<a000000100063140>] ia64_do_page_fault+0x830/0xa30
                                sp=e000000d8334fa90 bsp=e000000d83349740
 [<a00000010000c400>] ia64_leave_kernel+0x0/0x270
                                sp=e000000d8334fb20 bsp=e000000d83349740
 [<a000000100180400>] __alloc_pages_nodemask+0x1a0/0x1670
                                sp=e000000d8334fcf0 bsp=e000000d83349598
 [<a000000100d70100>] dma_direct_alloc+0x170/0x470
                                sp=e000000d8334fd50 bsp=e000000d83349518
 [<a0000001006a8770>] swiotlb_alloc+0x50/0x90
                                sp=e000000d8334fd50 bsp=e000000d833494d8
 [<a00000010083abd0>] dmam_alloc_coherent+0x250/0x2c0
                                sp=e000000d8334fd50 bsp=e000000d83349488
 [<a0000001009990c0>] ahci_port_start+0x2f0/0x4b0
                                sp=e000000d8334fd50 bsp=e000000d83349440
 [<a000000100958490>] ata_host_start+0x310/0x470
                                sp=e000000d8334fd60 bsp=e000000d833493d0
 [<a000000100964a70>] ata_host_activate+0x20/0x290
                                sp=e000000d8334fd60 bsp=e000000d83349370
 [<a000000100999570>] ahci_host_activate+0x2f0/0x300
                                sp=e000000d8334fd60 bsp=e000000d83349300
 [<a0000001009923d0>] ahci_init_one+0x1580/0x20b0
                                sp=e000000d8334fd60 bsp=e000000d83349258
 [<a0000001006d0610>] local_pci_probe+0x90/0x150
                                sp=e000000d8334fdc0 bsp=e000000d83349218
 [<a0000001006d1a30>] pci_device_probe+0x2f0/0x310
                                sp=e000000d8334fdc0 bsp=e000000d833491d8
 [<a0000001008229f0>] driver_probe_device+0x520/0x720
                                sp=e000000d8334fde0 bsp=e000000d83349170
 [<a000000100822d10>] __driver_attach+0x120/0x190
                                sp=e000000d8334fde0 bsp=e000000d83349140
 [<a00000010081ec00>] bus_for_each_dev+0x120/0x140
                                sp=e000000d8334fde0 bsp=e000000d83349100
 [<a000000100821bf0>] driver_attach+0x40/0x60
                                sp=e000000d8334fdf0 bsp=e000000d833490e0
 [<a0000001008211b0>] bus_add_driver+0x400/0x4a0
                                sp=e000000d8334fdf0 bsp=e000000d83349090
 [<a000000100823fc0>] driver_register+0x240/0x2d0
                                sp=e000000d8334fdf0 bsp=e000000d83349068
 [<a0000001006cfde0>] __pci_register_driver+0xa0/0xc0
                                sp=e000000d8334fdf0 bsp=e000000d83349038
 [<a0000001010ecdb0>] ahci_pci_driver_init+0x50/0x70
                                sp=e000000d8334fdf0 bsp=e000000d83349020
 [<a00000010000a950>] do_one_initcall+0x290/0x2a0
                                sp=e000000d8334fdf0 bsp=e000000d83348fe0
 [<a0000001010a1c10>] kernel_init_freeable+0x400/0x430
                                sp=e000000d8334fe30 bsp=e000000d83348f78
 [<a000000100d93860>] kernel_init+0x20/0x280
                                sp=e000000d8334fe30 bsp=e000000d83348f58
 [<a00000010000c1f0>] call_payload+0x50/0x80
                                sp=e000000d8334fe30 bsp=e000000d83348f40
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

---[ end Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b
```

...but because of the result below - spoiler: a v4.19.37 kernel working
on my rx2800 i2 - I assume they're created by the very same issue.

Starting at tag v4.19.37 I then reverted the following commits:

* cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80 [2]

* 16e73adbca76fd18733278cb688b0ddb4cad162c [3]

* 9d37c094dacda531ac3e529dd4dd139e3c0b7811 [4]

* 4fac8076df854aa4ddb8acbf6cce9d337300219e [5]

* 543cea9accd9804307541cb93d3ed7ec94b07237 [6]

...and compiled a kernel using the localmodconfig target to create a
minimal config. The resulting kernel booted fine on my rx2800 i2:

```
Linux version 4.19.37-00005-g55bd603c2590-dirty (root@rx2800-i2) (gcc
version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #1 SMP Thu Jun 20 23:58:57 CEST 2019
EFI v2.10 by HP:
efi:  SALsystab=0xdfdd63a18  ACPI 2.0=0x3d3c4014  HCDP=0xdffff8798
SMBIOS=0x3d368000
booting generic kernel on platform dig
PCDP: v3 at 0xdffff8798
earlycon: uart8250 at I/O port 0x4000 (options '115200n8')
bootconsole [uart8250] enabled
ACPI: Early table checksum verification disabled
ACPI: RSDP 0x000000003D3C4014 000024 (v02 HP    )
ACPI: XSDT 0x000000003D3C4580 000124 (v01 HP     RX2800-2 00000001
01000013)
[...]
 * Starting sshd ...
 [ ok ]
 * Starting local ...
 [ ok ]


This is rx2800-i2[...] (Linux ia64 4.19.37-00005-g55bd603c2590-dirty)
20:49:42
```

[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cf65a0f6f6ff7631ba0ac0513a14ca5b65320d80

[3]:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=16e73adbca76fd18733278cb688b0ddb4cad162c

[4]:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9d37c094dacda531ac3e529dd4dd139e3c0b7811

[5]:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4fac8076df854aa4ddb8acbf6cce9d337300219e

[6]:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237

****

Please note:

* that I'm always using the "ia64: fix ptrace" patch ([7]) in addition,
as I'm compiling with gcc 7.3.0 on Gentoo;

[7]: https://lore.kernel.org/patchwork/patch/884685/

* that the original problem only shows on my rx2800 i2 and not on my
other ia64 gear (rx4640 with Madison, rx2620 with Montecito and rx2660
with Montvale), so could be related to the different system architecture
of the Tukwila based rx2800 i2 (UMA => NUMA IIC);

I just now tried to compile a more recent v5.2-rc5 kernel with the above
commits reverted, but that fails. There seem to have been further
changes made since v4.19.37 for which I would still need to find the
respective commits to revert. But I assume this work could be unneeded
for a further examination of the problem, so I don't follow this for
now. If it is needed please let me know.

James Clarke already had an idea what could be involved in this issue.
Maybe he can give his assessment.

If you want me to try a patch for a specific Linux version, please let
me know. The same if you need further information from me.

Cheers
Frank




[Index of Archives]     [Linux Kernel]     [Sparc Linux]     [DCCP]     [Linux ARM]     [Yosemite News]     [Linux SCSI]     [Linux x86_64]     [Linux for Ham Radio]

  Powered by Linux