Re: x15: Unable to handle kernel NULL pointer dereference at virtual address 00000004 when read : pci_generic_config_read

"Arnd Bergmann" <arnd@xxxxxxxx> · Tue, 20 Jun 2023 10:40:30 +0200

On Tue, Jun 20, 2023, at 10:00, Naresh Kamboju wrote:
> We have been noticing the following kernel crash on x15 device while running
> LTP fs proc01 testing with Linux stable rc 6.x kernels.

Do you know if this is a regression with this kernel version compared
to older kernels running the same tests, or an added testcase in LTP
that exercises a code path that may have been broken for longer?

> Starting kernel ...
>
> [    0.000000] Booting Linux on physical CPU 0x0
> [    0.000000] Linux version 6.3.9-rc1 (tuxmake@tuxmake)
> (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld
> (GNU Binutils for Debian) 2.35.2) #1 SMP @1687172533
> [    0.000000] CPU: ARMv7 Processor [412fc0f2] revision 2 (ARMv7), cr=10c5387d
> [    0.000000] CPU: div instructions available: patching division code
> [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
> [    0.000000] OF: fdt: Machine model: TI AM5728 BeagleBoard-X15
>
> ..
> LTP fs tests running
>
> cd /opt/ltp
> ./runltp -f fs
>
> atch/ltp-lyYeJYjM8Y/fs_di-4743
>              Loops: 10
>     Data File Size: 30
> fs_di       0  TINFO  :  Test Started
> fs_di       0  TINFO  :  Completed Loop 1
> fs_di       0  TINFO  :  Completed Loop 2
> fs_di       0  TINFO  :  Completed Loop 3
> fs_di       0  TINFO  :  Completed Loop 4
> fs_di       0  TINFO  :  Completed Loop 5
> fs_di       0  TINFO  :  Completed Loop 6
> fs_di       0  TINFO  :  Completed Loop 7
> fs_di       0  TINFO  :  Completed Loop 8
> fs_di       0  TINFO  :  Completed Loop 9
> fs_di       0  TINFO  :  Completed Loop 10
> fs_di      10  TPASS  :  Test Successful
> [ 1212.864074] 8<--- cut here ---
> [ 1212.867156] Unable to handle kernel NULL pointer dereference at
> virtual address 00000004 when read
> [ 1212.876159] [00000004] *pgd=fb342835
> [ 1212.879760] Internal error: Oops: 17 [#1] SMP ARM
> [ 1212.884490] Modules linked in: etnaviv gpu_sched
> snd_soc_simple_card snd_soc_simple_card_utils onboard_usb_hub
> snd_soc_davinci_mcasp snd_soc_ti_udma snd_soc_ti_edma snd_soc_ti_sdma
> snd_soc_core ac97_bus snd_pcm_dmaengine snd_pcm cfg80211 snd_timer snd
> soundcore bluetooth display_connector
> [ 1212.910217] CPU: 0 PID: 4855 Comm: proc01 Not tainted 6.3.9-rc1 #1
> [ 1212.916442] Hardware name: Generic DRA74X (Flattened Device Tree)
> [ 1212.922546] PC is at pci_generic_config_read+0x34/0x8c
> [ 1212.927734] LR is at pci_generic_config_read+0x1c/0x8c

It looks like the PCIe bus is not set up correctly, I also
see these messages in the log indicating a problem with it:

[    3.334503] dra7-pcie 51000000.pcie: host bridge /ocp/target-module@51000000/pcie@51000000 ranges:
[    3.343627] dra7-pcie 51000000.pcie:       IO 0x0020003000..0x0020012fff -> 0x0000000000
[    3.351806] dra7-pcie 51000000.pcie:      MEM 0x0020013000..0x002fffffff -> 0x0020013000
[    3.362030] dra7-pcie 51000000.pcie: iATU: unroll F, 16 ob, 4 ib, align 4K, limit 4G
[    4.370635] dra7-pcie 51000000.pcie: Phy link never came up
[    4.376831] dra7-pcie 51000000.pcie: PCI host bridge to bus 0000:00
[    4.383148] pci_bus 0000:00: root bus resource [bus 00-ff]
[    4.388702] pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
[    4.394927] pci_bus 0000:00: root bus resource [mem 0x20013000-0x2fffffff]
[    4.401885] pci 0000:00:00.0: [104c:8888] type 01 class 0x060400
[    4.407958] pci 0000:00:00.0: reg 0x10: [mem 0x00000000-0x000fffff]
[    4.414245] pci 0000:00:00.0: reg 0x14: [mem 0x00000000-0x0000ffff]
[    4.420654] pci 0000:00:00.0: supports D1
[    4.424682] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[    4.437499] PCI: bus0: Fast back to back transfers disabled
[    4.443389] PCI: bus1: Fast back to back transfers enabled
[    4.448974] pci 0000:00:00.0: BAR 0: assigned [mem 0x20100000-0x201fffff]
[    4.455810] pci 0000:00:00.0: BAR 1: assigned [mem 0x20020000-0x2002ffff]
[    4.462646] pci 0000:00:00.0: PCI bridge to [bus 01-ff]
[    4.468322] pcieport 0000:00:00.0: PME: Signaling with IRQ 135
[    4.474487] genirq: Threaded irq requested with handler=NULL and !ONESHOT for dra7xx-pcie-main (irq 132)
[    4.484100] dra7-pcie 51000000.pcie: failed to request irq
[    4.489685] dra7-pcie: probe of 51000000.pcie failed with error -22
[    4.503967] pcie-clkctrl:0000:0: failed to disable

The function that crashed is

int pci_generic_config_read(struct pci_bus *bus, unsigned int devfn,
                            int where, int size, u32 *val)
{
        void __iomem *addr;

        addr = bus->ops->map_bus(bus, devfn, where);
        if (!addr)
                return PCIBIOS_DEVICE_NOT_FOUND;

        if (size == 1)
                *val = readb(addr);
        else if (size == 2) 
                *val = readw(addr); 
        else
                *val = readl(addr);

        return PCIBIOS_SUCCESSFUL;
}                 

I have not disassembled the vmlinux file, but I can see that the
offset into the NULL pointer is '4', which does not match the
structur offsets for bus->ops or ops->map_bus.

I also see that if map_bus returns NULL, we treat that as
an error, but if it returns '4', that is taken as a pointer,
which is my best guess at what is happening here.

map_bus() seems to be either dw_pcie_other_conf_map_bus() or
dw_pcie_own_conf_map_bus(), since the dra7 does not have its
own variant but inherits these from the dwc pci driver.

I think this is caused by the combination of two bugs:

- something prevents the dra7-pcie driver from probing the
  device correctly, ultimately failing with the "failed to
  request irq" message.

- The error handling in dra7xx_pcie_probe() fails to clean
  up after the first problem, leaving the PCIe host
  in a broken state instead of removing it entirely.

       Arnd