Re: [BUG] PCI: rockchip: rk3399: pcie switch support

Robin Murphy <robin.murphy@xxxxxxx> · Tue, 14 Apr 2020 13:28:05 +0100

On 2020-04-14 12:35 pm, Soeren Moch wrote:
On 06.04.20 19:12, Soeren Moch wrote:
On 06.04.20 14:52, Robin Murphy wrote:
On 2020-04-04 7:41 pm, Soeren Moch wrote:
I want to use a PCIe switch on a RK3399 based RockPro64 V2.1 board.
"Normal" PCIe cards work (mostly) just fine on this board. The PCIe
switches (I tried Pericom and ASMedia based switches) also work fine on
other boards. The RK3399 PCIe controller with pcie_rockchip_host driver
also recognises the switch, but fails to initialize the buses behind the
bridge properly, see syslog from linux-5.6.0.

Any ideas what I do wrong, or any suggestions what I can test here?
See the thread here:

https://lore.kernel.org/linux-pci/CAMdYzYoTwjKz4EN8PtD5pZfu3+SX+68JL+dfvmCrSnLL=K6Few@xxxxxxxxxxxxxx/

Thanks Robin!

I also found out in the meantime that device enumeration fails in this
fatal way when probing non-existent devices. So if I hack my complete
bus topology into rockchip_pcie_valid_device, then all existing devices
come up properly. Of course this is not how PCIe should work.
The conclusion there seems to be that the RK3399 root complex just
doesn't handle certain types of response in a sensible manner, and
there's not much that can reasonably be done to change that.
Hm, at least there is the promising suggestion to take over the SError
handler, maybe in ATF, as workaround.
Unfortunately it seems to be not that easy. Only when PCIe device
probing runs on one of the Cortex-A72 cores of rk3399 we see the SError.
When probing runs on one of the A53 cores, we get a synchronous external
abort instead.

Is this expected to see different error types on big.LITTLE systems? Or
is this another special property of the rk3399 pcie controller?

As far as I'm aware, the CPU microarchitecture is indeed one of the 
factors in whether it takes a given external abort synchronously or 
asynchronously, so yes, I'd say that probably is expected. I wouldn't 
necessarily even rely on a single microarchitecture only behaving one 
way, since in principle it's possible that surrounding instructions 
might affect whether the core still has enough context left to take the 
exception synchronously or not at the point the abort does come back.

In general external aborts are a "should never happen" kind of thing, so 
they're not necessarily expected to be recoverable (I think the RAS 
extensions might add a more robustness in terms of reporting, but aren't 
relevant here either way).

At this point I'm starting to wonder whether it might be possible to do 
something similar to the Arm N1SDP workaround using the Cortex-M0, 
albeit with the complication that probing would realistically have to be 
explicitly invoked from the Linux driver due to clocks and external 
regulators... :/

Robin.