On Tue, Jun 20, 2023, at 10:00, Naresh Kamboju wrote: > We have been noticing the following kernel crash on x15 device while running > LTP fs proc01 testing with Linux stable rc 6.x kernels. Do you know if this is a regression with this kernel version compared to older kernels running the same tests, or an added testcase in LTP that exercises a code path that may have been broken for longer? > Starting kernel ... > > [ 0.000000] Booting Linux on physical CPU 0x0 > [ 0.000000] Linux version 6.3.9-rc1 (tuxmake@tuxmake) > (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld > (GNU Binutils for Debian) 2.35.2) #1 SMP @1687172533 > [ 0.000000] CPU: ARMv7 Processor [412fc0f2] revision 2 (ARMv7), cr=10c5387d > [ 0.000000] CPU: div instructions available: patching division code > [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache > [ 0.000000] OF: fdt: Machine model: TI AM5728 BeagleBoard-X15 > > .. > LTP fs tests running > > cd /opt/ltp > ./runltp -f fs > > atch/ltp-lyYeJYjM8Y/fs_di-4743 > Loops: 10 > Data File Size: 30 > fs_di 0 TINFO : Test Started > fs_di 0 TINFO : Completed Loop 1 > fs_di 0 TINFO : Completed Loop 2 > fs_di 0 TINFO : Completed Loop 3 > fs_di 0 TINFO : Completed Loop 4 > fs_di 0 TINFO : Completed Loop 5 > fs_di 0 TINFO : Completed Loop 6 > fs_di 0 TINFO : Completed Loop 7 > fs_di 0 TINFO : Completed Loop 8 > fs_di 0 TINFO : Completed Loop 9 > fs_di 0 TINFO : Completed Loop 10 > fs_di 10 TPASS : Test Successful > [ 1212.864074] 8<--- cut here --- > [ 1212.867156] Unable to handle kernel NULL pointer dereference at > virtual address 00000004 when read > [ 1212.876159] [00000004] *pgd=fb342835 > [ 1212.879760] Internal error: Oops: 17 [#1] SMP ARM > [ 1212.884490] Modules linked in: etnaviv gpu_sched > snd_soc_simple_card snd_soc_simple_card_utils onboard_usb_hub > snd_soc_davinci_mcasp snd_soc_ti_udma snd_soc_ti_edma snd_soc_ti_sdma > snd_soc_core ac97_bus snd_pcm_dmaengine snd_pcm cfg80211 snd_timer snd > soundcore bluetooth display_connector > [ 1212.910217] CPU: 0 PID: 4855 Comm: proc01 Not tainted 6.3.9-rc1 #1 > [ 1212.916442] Hardware name: Generic DRA74X (Flattened Device Tree) > [ 1212.922546] PC is at pci_generic_config_read+0x34/0x8c > [ 1212.927734] LR is at pci_generic_config_read+0x1c/0x8c It looks like the PCIe bus is not set up correctly, I also see these messages in the log indicating a problem with it: [ 3.334503] dra7-pcie 51000000.pcie: host bridge /ocp/target-module@51000000/pcie@51000000 ranges: [ 3.343627] dra7-pcie 51000000.pcie: IO 0x0020003000..0x0020012fff -> 0x0000000000 [ 3.351806] dra7-pcie 51000000.pcie: MEM 0x0020013000..0x002fffffff -> 0x0020013000 [ 3.362030] dra7-pcie 51000000.pcie: iATU: unroll F, 16 ob, 4 ib, align 4K, limit 4G [ 4.370635] dra7-pcie 51000000.pcie: Phy link never came up [ 4.376831] dra7-pcie 51000000.pcie: PCI host bridge to bus 0000:00 [ 4.383148] pci_bus 0000:00: root bus resource [bus 00-ff] [ 4.388702] pci_bus 0000:00: root bus resource [io 0x0000-0xffff] [ 4.394927] pci_bus 0000:00: root bus resource [mem 0x20013000-0x2fffffff] [ 4.401885] pci 0000:00:00.0: [104c:8888] type 01 class 0x060400 [ 4.407958] pci 0000:00:00.0: reg 0x10: [mem 0x00000000-0x000fffff] [ 4.414245] pci 0000:00:00.0: reg 0x14: [mem 0x00000000-0x0000ffff] [ 4.420654] pci 0000:00:00.0: supports D1 [ 4.424682] pci 0000:00:00.0: PME# supported from D0 D1 D3hot [ 4.437499] PCI: bus0: Fast back to back transfers disabled [ 4.443389] PCI: bus1: Fast back to back transfers enabled [ 4.448974] pci 0000:00:00.0: BAR 0: assigned [mem 0x20100000-0x201fffff] [ 4.455810] pci 0000:00:00.0: BAR 1: assigned [mem 0x20020000-0x2002ffff] [ 4.462646] pci 0000:00:00.0: PCI bridge to [bus 01-ff] [ 4.468322] pcieport 0000:00:00.0: PME: Signaling with IRQ 135 [ 4.474487] genirq: Threaded irq requested with handler=NULL and !ONESHOT for dra7xx-pcie-main (irq 132) [ 4.484100] dra7-pcie 51000000.pcie: failed to request irq [ 4.489685] dra7-pcie: probe of 51000000.pcie failed with error -22 [ 4.503967] pcie-clkctrl:0000:0: failed to disable The function that crashed is int pci_generic_config_read(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *val) { void __iomem *addr; addr = bus->ops->map_bus(bus, devfn, where); if (!addr) return PCIBIOS_DEVICE_NOT_FOUND; if (size == 1) *val = readb(addr); else if (size == 2) *val = readw(addr); else *val = readl(addr); return PCIBIOS_SUCCESSFUL; } I have not disassembled the vmlinux file, but I can see that the offset into the NULL pointer is '4', which does not match the structur offsets for bus->ops or ops->map_bus. I also see that if map_bus returns NULL, we treat that as an error, but if it returns '4', that is taken as a pointer, which is my best guess at what is happening here. map_bus() seems to be either dw_pcie_other_conf_map_bus() or dw_pcie_own_conf_map_bus(), since the dra7 does not have its own variant but inherits these from the dwc pci driver. I think this is caused by the combination of two bugs: - something prevents the dra7-pcie driver from probing the device correctly, ultimately failing with the "failed to request irq" message. - The error handling in dra7xx_pcie_probe() fails to clean up after the first problem, leaving the PCIe host in a broken state instead of removing it entirely. Arnd