Re: x15: Unable to handle kernel NULL pointer dereference at virtual address 00000004 when read : pci_generic_config_read

Naresh Kamboju <naresh.kamboju@xxxxxxxxxx> · Wed, 26 Jul 2023 15:29:31 +0530

On Tue, 20 Jun 2023 at 14:10, Arnd Bergmann <arnd@xxxxxxxx> wrote:
>
> On Tue, Jun 20, 2023, at 10:00, Naresh Kamboju wrote:
> > We have been noticing the following kernel crash on x15 device while running
> > LTP fs proc01 testing with Linux stable rc 6.x kernels.
>
> Do you know if this is a regression with this kernel version compared
> to older kernels running the same tests, or an added testcase in LTP
> that exercises a code path that may have been broken for longer?
>
> > Starting kernel ...
> >
> > [    0.000000] Booting Linux on physical CPU 0x0
> > [    0.000000] Linux version 6.3.9-rc1 (tuxmake@tuxmake)
> > (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld
> > (GNU Binutils for Debian) 2.35.2) #1 SMP @1687172533
> > [    0.000000] CPU: ARMv7 Processor [412fc0f2] revision 2 (ARMv7), cr=10c5387d
> > [    0.000000] CPU: div instructions available: patching division code
> > [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
> > [    0.000000] OF: fdt: Machine model: TI AM5728 BeagleBoard-X15
> >
> > ..
> > LTP fs tests running
> >
> > cd /opt/ltp
> > ./runltp -f fs
> >
> > atch/ltp-lyYeJYjM8Y/fs_di-4743
> >              Loops: 10
> >     Data File Size: 30
> > fs_di       0  TINFO  :  Test Started
> > fs_di       0  TINFO  :  Completed Loop 1
> > fs_di       0  TINFO  :  Completed Loop 2
> > fs_di       0  TINFO  :  Completed Loop 3
> > fs_di       0  TINFO  :  Completed Loop 4
> > fs_di       0  TINFO  :  Completed Loop 5
> > fs_di       0  TINFO  :  Completed Loop 6
> > fs_di       0  TINFO  :  Completed Loop 7
> > fs_di       0  TINFO  :  Completed Loop 8
> > fs_di       0  TINFO  :  Completed Loop 9
> > fs_di       0  TINFO  :  Completed Loop 10
> > fs_di      10  TPASS  :  Test Successful
> > [ 1212.864074] 8<--- cut here ---
> > [ 1212.867156] Unable to handle kernel NULL pointer dereference at
> > virtual address 00000004 when read
> > [ 1212.876159] [00000004] *pgd=fb342835
> > [ 1212.879760] Internal error: Oops: 17 [#1] SMP ARM
> > [ 1212.884490] Modules linked in: etnaviv gpu_sched
> > snd_soc_simple_card snd_soc_simple_card_utils onboard_usb_hub
> > snd_soc_davinci_mcasp snd_soc_ti_udma snd_soc_ti_edma snd_soc_ti_sdma
> > snd_soc_core ac97_bus snd_pcm_dmaengine snd_pcm cfg80211 snd_timer snd
> > soundcore bluetooth display_connector
> > [ 1212.910217] CPU: 0 PID: 4855 Comm: proc01 Not tainted 6.3.9-rc1 #1
> > [ 1212.916442] Hardware name: Generic DRA74X (Flattened Device Tree)
> > [ 1212.922546] PC is at pci_generic_config_read+0x34/0x8c
> > [ 1212.927734] LR is at pci_generic_config_read+0x1c/0x8c
>
> It looks like the PCIe bus is not set up correctly, I also
> see these messages in the log indicating a problem with it:
>
> [    3.334503] dra7-pcie 51000000.pcie: host bridge /ocp/target-module@51000000/pcie@51000000 ranges:
> [    3.343627] dra7-pcie 51000000.pcie:       IO 0x0020003000..0x0020012fff -> 0x0000000000
> [    3.351806] dra7-pcie 51000000.pcie:      MEM 0x0020013000..0x002fffffff -> 0x0020013000
> [    3.362030] dra7-pcie 51000000.pcie: iATU: unroll F, 16 ob, 4 ib, align 4K, limit 4G
> [    4.370635] dra7-pcie 51000000.pcie: Phy link never came up
> [    4.376831] dra7-pcie 51000000.pcie: PCI host bridge to bus 0000:00
> [    4.383148] pci_bus 0000:00: root bus resource [bus 00-ff]
> [    4.388702] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> [    4.394927] pci_bus 0000:00: root bus resource [mem 0x20013000-0x2fffffff]
> [    4.401885] pci 0000:00:00.0: [104c:8888] type 01 class 0x060400
> [    4.407958] pci 0000:00:00.0: reg 0x10: [mem 0x00000000-0x000fffff]
> [    4.414245] pci 0000:00:00.0: reg 0x14: [mem 0x00000000-0x0000ffff]
> [    4.420654] pci 0000:00:00.0: supports D1
> [    4.424682] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
> [    4.437499] PCI: bus0: Fast back to back transfers disabled
> [    4.443389] PCI: bus1: Fast back to back transfers enabled
> [    4.448974] pci 0000:00:00.0: BAR 0: assigned [mem 0x20100000-0x201fffff]
> [    4.455810] pci 0000:00:00.0: BAR 1: assigned [mem 0x20020000-0x2002ffff]
> [    4.462646] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
> [    4.468322] pcieport 0000:00:00.0: PME: Signaling with IRQ 135
> [    4.474487] genirq: Threaded irq requested with handler=NULL and !ONESHOT for dra7xx-pcie-main (irq 132)
> [    4.484100] dra7-pcie 51000000.pcie: failed to request irq
> [    4.489685] dra7-pcie: probe of 51000000.pcie failed with error -22
> [    4.503967] pcie-clkctrl:0000:0: failed to disable
>
> The function that crashed is
>
> int pci_generic_config_read(struct pci_bus *bus, unsigned int devfn,
>                             int where, int size, u32 *val)
> {
>         void __iomem *addr;
>
>         addr = bus->ops->map_bus(bus, devfn, where);
>         if (!addr)
>                 return PCIBIOS_DEVICE_NOT_FOUND;
>
>         if (size == 1)
>                 *val = readb(addr);
>         else if (size == 2)
>                 *val = readw(addr);
>         else
>                 *val = readl(addr);
>
>         return PCIBIOS_SUCCESSFUL;
> }
>
> I have not disassembled the vmlinux file, but I can see that the
> offset into the NULL pointer is '4', which does not match the
> structur offsets for bus->ops or ops->map_bus.
>
> I also see that if map_bus returns NULL, we treat that as
> an error, but if it returns '4', that is taken as a pointer,
> which is my best guess at what is happening here.
>
> map_bus() seems to be either dw_pcie_other_conf_map_bus() or
> dw_pcie_own_conf_map_bus(), since the dra7 does not have its
> own variant but inherits these from the dwc pci driver.
>
> I think this is caused by the combination of two bugs:
>
> - something prevents the dra7-pcie driver from probing the
>   device correctly, ultimately failing with the "failed to
>   request irq" message.
>
> - The error handling in dra7xx_pcie_probe() fails to clean
>   up after the first problem, leaving the PCIe host
>   in a broken state instead of removing it entirely.

The reported kernel crash is continuously happening on the
BeagleBoard x15 device while running LTP fs tests on stable rc 6.4.7-rc1.

fs_di      10  TPASS  :  Test Successful
[ 1195.556701] 8<--- cut here ---
[ 1195.559783] Unable to handle kernel NULL pointer dereference at
virtual address 00000004 when read
[ 1195.568786] [00000004] *pgd=00000000
[ 1195.572387] Internal error: Oops: 5 [#1] SMP ARM
[ 1195.577026] Modules linked in: etnaviv gpu_sched
snd_soc_simple_card snd_soc_simple_card_utils onboard_usb_hub
snd_soc_davinci_mcasp snd_soc_ti_udma snd_soc_ti_edma snd_soc_ti_sdma
snd_soc_core ac97_bus snd_pcm_dmaengine snd_pcm snd_timer snd
soundcore display_connector
[ 1195.601104] CPU: 0 PID: 4876 Comm: proc01 Not tainted 6.4.7-rc1 #1
[ 1195.607330] Hardware name: Generic DRA74X (Flattened Device Tree)
[ 1195.613464] PC is at pci_generic_config_read+0x34/0x8c
[ 1195.618621] LR is at pci_generic_config_read+0x1c/0x8c

Links,
 - https://lkft.validation.linaro.org/scheduler/job/6619189#L3236
 - https://storage.tuxsuite.com/public/linaro/lkft/builds/2T3uHpNM7MkE9BOTcs22aOVCDnw/

- Naresh

>
>        Arnd