Re: xhci_pci & PCIe hotplug crash

Pali Rohár <pali@xxxxxxxxxx> · Wed, 5 May 2021 17:39:42 +0200

On Wednesday 05 May 2021 15:20:11 David Laight wrote:
> From: Pali Rohár
> > Sent: 05 May 2021 14:03
> ...
> > I already figured out that CPU receive external abort also when trying
> > to issue a new PIO transfer for accessing PCI config space while
> > previous transfer has not finished yet. And also there is no way (at
> > least in documentation) which allows to "mask" this external abort. But
> > this issue can be fixed in pci-aardvark.c driver to disallow access to
> > config space while previous transfer is still running (I will send patch
> > for this one).
> 
> My the sound of the above you need to put a global spinlock around
> all PCIe config space accesses.

Kernel already uses raw_spin_lock_irqsave(), see pci_lock_config() macro
in pci/access.c which implements this global lock for config space
access.

But issue is that pci-driver.c does not wait for finishing transfer and
return from function which unlock this spin lock...

Week ago I fixed this issue in U-Boot and similar fix would be needed
also for kernel https://source.denx.de/u-boot/u-boot/-/commit/eccbd4ad8e4e

But this issue is not related to my original report about XHCI & PCI.

> Is this the horrid hardware that can't do a 'normal' PCIe transfer
> while a config space access is in progress?

Issue is different. You cannot do config space PIO transfer while
another config space PIO transfer is in progress.

> If that it true then you have bigger problems.
> Especially if it is an SMP system.

I really hope that memory read or write transfer can be initiated while
config transfer is in progress. Marvell A3720 platform on which can be
found this pci aardvark controller is 2 core CPU SoC.

At least I have not seen any abort when PCIe link is up, card connected
and previous config access transfer finished.

> > So seems that PCIe controller HW triggers these external aborts when
> > device on PCIe bus is not accessible anymore.
> > 
> > If this issue is really caused by MMIO access from xhci driver when
> > device is not accessible on the bus anymore, can we do something to
> > prevent this kernel crash? Somehow mask that external abort in kernel
> > for a time during MMIO access?
> 
> If it is a cycle abort then the interrupted address is probably
> that of the MMIO instruction.
> So you need to catch the abort, emulate the instruction and
> then return to the next one.

Has kernel API & infrastructure for catching these aborts and executing
own driver handler when abort happens?

> This probably requires an exception table containing the address
> of every readb/w/l() instruction.
> 
> If you get a similar error on writes it is likely to be a few
> instructions after the actual writeb/w/l() instruction.
> Write are normally 'posted' and asynchronous.
> 
> If you are really lucky you can get enough state out of the
> abort handler to fixup/ignore the cycle without an
> exception table.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)