Re: xhci_pci & PCIe hotplug crash

Pali Rohár <pali@xxxxxxxxxx> · Wed, 5 May 2021 15:02:40 +0200

On Wednesday 05 May 2021 14:44:02 Lukas Wunner wrote:
> On Wed, May 05, 2021 at 02:33:46PM +0200, Pali Rohár wrote:
> > I just spotted this crash during debugging PCIe controller driver
> > pci-aardvark.c with trying to expose its link down events via "hot plug"
> > interrupt and corresponding link layer state flags.
> > 
> > And because in whole call trace I see only generic PCIe and USB code
> > path without any driver specific parts, I suspect that this is not PCIe
> > controller-specific issue but rather something "wrong" in genetic PCIe
> > (or USB) code. That is why I sent this email, so maybe somebody else
> > find something suspicious here.
> > 
> > But still there is a chance that issue can be also in pci-aardvark.c
> > driver and somehow it masked its issue and propagated it into generic
> > PCIe hot plug code path.
> 
> If you hot-remove the XHCI controller, accesses to its MMIO space
> will fail.  xhci_irq() seems to perform such MMIO accesses.

That abort happens at offset 4d00, here is part of objdump:

        if (!arch_irqs_disabled_flags(flags))
    4ccc:       340014a0        cbz     w0, 4f60 <xhci_irq+0x2d0>
    4cd0:       d2800000        mov     x0, #0x0                        // #0
    4cd4:       910a7276        add     x22, x19, #0x29c
    4cd8:       52800022        mov     w2, #0x1                        // #1
    4cdc:       f98002d1        prfm    pstl1strm, [x22]
    4ce0:       885ffec1        ldaxr   w1, [x22]
    4ce4:       4a000023        eor     w3, w1, w0
    4ce8:       35000063        cbnz    w3, 4cf4 <xhci_irq+0x64>
    4cec:       88037ec2        stxr    w3, w2, [x22]
    4cf0:       35ffff83        cbnz    w3, 4ce0 <xhci_irq+0x50>
    4cf4:       35002741        cbnz    w1, 51dc <xhci_irq+0x54c>
        status = readl(&xhci->op_regs->status);
    4cf8:       f9400f41        ldr     x1, [x26, #24]
    4cfc:       91001021        add     x1, x1, #0x4
    4d00:       b9400021        ldr     w1, [x1]

So it looks like it is that MMIO access, right?

> Normally this should happen silently and MMIO accesses just return
> with a fabricated "all ones" response.  Chances are however that the
> Aardvark controller raises a synchronous external abort instead.

This makes sense. Good catch lso with fact that it is from threaded
context!

> Perhaps you can teach it not to do that.

No :-( I read all documentation which is available for this PCIe
controller, part of Marvell A3720 SoC and I have not found anything
which allows me to configure raising external aborts.

I already figured out that CPU receive external abort also when trying
to issue a new PIO transfer for accessing PCI config space while
previous transfer has not finished yet. And also there is no way (at
least in documentation) which allows to "mask" this external abort. But
this issue can be fixed in pci-aardvark.c driver to disallow access to
config space while previous transfer is still running (I will send patch
for this one).

So seems that PCIe controller HW triggers these external aborts when
device on PCIe bus is not accessible anymore.

If this issue is really caused by MMIO access from xhci driver when
device is not accessible on the bus anymore, can we do something to
prevent this kernel crash? Somehow mask that external abort in kernel
for a time during MMIO access?

> Thanks,
> 
> Lukas