On Tue, 20 Jun 2023 at 14:10, Arnd Bergmann <arnd@xxxxxxxx> wrote: > > On Tue, Jun 20, 2023, at 10:00, Naresh Kamboju wrote: > > We have been noticing the following kernel crash on x15 device while running > > LTP fs proc01 testing with Linux stable rc 6.x kernels. > > Do you know if this is a regression with this kernel version compared > to older kernels running the same tests, or an added testcase in LTP > that exercises a code path that may have been broken for longer? > > > Starting kernel ... > > > > [ 0.000000] Booting Linux on physical CPU 0x0 > > [ 0.000000] Linux version 6.3.9-rc1 (tuxmake@tuxmake) > > (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld > > (GNU Binutils for Debian) 2.35.2) #1 SMP @1687172533 > > [ 0.000000] CPU: ARMv7 Processor [412fc0f2] revision 2 (ARMv7), cr=10c5387d > > [ 0.000000] CPU: div instructions available: patching division code > > [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache > > [ 0.000000] OF: fdt: Machine model: TI AM5728 BeagleBoard-X15 > > > > .. > > LTP fs tests running > > > > cd /opt/ltp > > ./runltp -f fs > > > > atch/ltp-lyYeJYjM8Y/fs_di-4743 > > Loops: 10 > > Data File Size: 30 > > fs_di 0 TINFO : Test Started > > fs_di 0 TINFO : Completed Loop 1 > > fs_di 0 TINFO : Completed Loop 2 > > fs_di 0 TINFO : Completed Loop 3 > > fs_di 0 TINFO : Completed Loop 4 > > fs_di 0 TINFO : Completed Loop 5 > > fs_di 0 TINFO : Completed Loop 6 > > fs_di 0 TINFO : Completed Loop 7 > > fs_di 0 TINFO : Completed Loop 8 > > fs_di 0 TINFO : Completed Loop 9 > > fs_di 0 TINFO : Completed Loop 10 > > fs_di 10 TPASS : Test Successful > > [ 1212.864074] 8<--- cut here --- > > [ 1212.867156] Unable to handle kernel NULL pointer dereference at > > virtual address 00000004 when read > > [ 1212.876159] [00000004] *pgd=fb342835 > > [ 1212.879760] Internal error: Oops: 17 [#1] SMP ARM > > [ 1212.884490] Modules linked in: etnaviv gpu_sched > > snd_soc_simple_card snd_soc_simple_card_utils onboard_usb_hub > > snd_soc_davinci_mcasp snd_soc_ti_udma snd_soc_ti_edma snd_soc_ti_sdma > > snd_soc_core ac97_bus snd_pcm_dmaengine snd_pcm cfg80211 snd_timer snd > > soundcore bluetooth display_connector > > [ 1212.910217] CPU: 0 PID: 4855 Comm: proc01 Not tainted 6.3.9-rc1 #1 > > [ 1212.916442] Hardware name: Generic DRA74X (Flattened Device Tree) > > [ 1212.922546] PC is at pci_generic_config_read+0x34/0x8c > > [ 1212.927734] LR is at pci_generic_config_read+0x1c/0x8c > > It looks like the PCIe bus is not set up correctly, I also > see these messages in the log indicating a problem with it: > > [ 3.334503] dra7-pcie 51000000.pcie: host bridge /ocp/target-module@51000000/pcie@51000000 ranges: > [ 3.343627] dra7-pcie 51000000.pcie: IO 0x0020003000..0x0020012fff -> 0x0000000000 > [ 3.351806] dra7-pcie 51000000.pcie: MEM 0x0020013000..0x002fffffff -> 0x0020013000 > [ 3.362030] dra7-pcie 51000000.pcie: iATU: unroll F, 16 ob, 4 ib, align 4K, limit 4G > [ 4.370635] dra7-pcie 51000000.pcie: Phy link never came up > [ 4.376831] dra7-pcie 51000000.pcie: PCI host bridge to bus 0000:00 > [ 4.383148] pci_bus 0000:00: root bus resource [bus 00-ff] > [ 4.388702] pci_bus 0000:00: root bus resource [io 0x0000-0xffff] > [ 4.394927] pci_bus 0000:00: root bus resource [mem 0x20013000-0x2fffffff] > [ 4.401885] pci 0000:00:00.0: [104c:8888] type 01 class 0x060400 > [ 4.407958] pci 0000:00:00.0: reg 0x10: [mem 0x00000000-0x000fffff] > [ 4.414245] pci 0000:00:00.0: reg 0x14: [mem 0x00000000-0x0000ffff] > [ 4.420654] pci 0000:00:00.0: supports D1 > [ 4.424682] pci 0000:00:00.0: PME# supported from D0 D1 D3hot > [ 4.437499] PCI: bus0: Fast back to back transfers disabled > [ 4.443389] PCI: bus1: Fast back to back transfers enabled > [ 4.448974] pci 0000:00:00.0: BAR 0: assigned [mem 0x20100000-0x201fffff] > [ 4.455810] pci 0000:00:00.0: BAR 1: assigned [mem 0x20020000-0x2002ffff] > [ 4.462646] pci 0000:00:00.0: PCI bridge to [bus 01-ff] > [ 4.468322] pcieport 0000:00:00.0: PME: Signaling with IRQ 135 > [ 4.474487] genirq: Threaded irq requested with handler=NULL and !ONESHOT for dra7xx-pcie-main (irq 132) > [ 4.484100] dra7-pcie 51000000.pcie: failed to request irq > [ 4.489685] dra7-pcie: probe of 51000000.pcie failed with error -22 > [ 4.503967] pcie-clkctrl:0000:0: failed to disable > > The function that crashed is > > int pci_generic_config_read(struct pci_bus *bus, unsigned int devfn, > int where, int size, u32 *val) > { > void __iomem *addr; > > addr = bus->ops->map_bus(bus, devfn, where); > if (!addr) > return PCIBIOS_DEVICE_NOT_FOUND; > > if (size == 1) > *val = readb(addr); > else if (size == 2) > *val = readw(addr); > else > *val = readl(addr); > > return PCIBIOS_SUCCESSFUL; > } > > I have not disassembled the vmlinux file, but I can see that the > offset into the NULL pointer is '4', which does not match the > structur offsets for bus->ops or ops->map_bus. > > I also see that if map_bus returns NULL, we treat that as > an error, but if it returns '4', that is taken as a pointer, > which is my best guess at what is happening here. > > map_bus() seems to be either dw_pcie_other_conf_map_bus() or > dw_pcie_own_conf_map_bus(), since the dra7 does not have its > own variant but inherits these from the dwc pci driver. > > I think this is caused by the combination of two bugs: > > - something prevents the dra7-pcie driver from probing the > device correctly, ultimately failing with the "failed to > request irq" message. > > - The error handling in dra7xx_pcie_probe() fails to clean > up after the first problem, leaving the PCIe host > in a broken state instead of removing it entirely. The reported kernel crash is continuously happening on the BeagleBoard x15 device while running LTP fs tests on stable rc 6.4.7-rc1. fs_di 10 TPASS : Test Successful [ 1195.556701] 8<--- cut here --- [ 1195.559783] Unable to handle kernel NULL pointer dereference at virtual address 00000004 when read [ 1195.568786] [00000004] *pgd=00000000 [ 1195.572387] Internal error: Oops: 5 [#1] SMP ARM [ 1195.577026] Modules linked in: etnaviv gpu_sched snd_soc_simple_card snd_soc_simple_card_utils onboard_usb_hub snd_soc_davinci_mcasp snd_soc_ti_udma snd_soc_ti_edma snd_soc_ti_sdma snd_soc_core ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd soundcore display_connector [ 1195.601104] CPU: 0 PID: 4876 Comm: proc01 Not tainted 6.4.7-rc1 #1 [ 1195.607330] Hardware name: Generic DRA74X (Flattened Device Tree) [ 1195.613464] PC is at pci_generic_config_read+0x34/0x8c [ 1195.618621] LR is at pci_generic_config_read+0x1c/0x8c Links, - https://lkft.validation.linaro.org/scheduler/job/6619189#L3236 - https://storage.tuxsuite.com/public/linaro/lkft/builds/2T3uHpNM7MkE9BOTcs22aOVCDnw/ - Naresh > > Arnd