Re: [PATCH v2 0/9] PCI: rockchip: Fix RK3399 PCIe endpoint controller driver

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Fri, 17 Mar 2023 07:09:04 +0900

On 3/17/23 01:34, Rick Wertenbroek wrote:
>>> By the way, enabling the interrupts to see the error notifications, I do see a
>>> lot of retry timeout and other recoverable errors. So the issues I am seeing
>>> could be due to my PCI cable setup that is not ideal (bad signal, ground loops,
>>> ... ?). Not sure. I do not have a PCI analyzer handy :)
> 
> I have enabled the IRQs and messages thanks to your patches but I don't get
> messages from the IRQs (it seems no IRQs are fired). My PCIe link seems stable.
> The main issue I face is still that after a random amount of time, the BARs are
> reset to 0, I don't have a PCIe analyzer so I cannot chase config space TLPs
> (e.g., host writing the BAR values to the config header), but I don't think that
> the problem comes from a TLP issued from the host. (it might be).

Hmmm... I am getting lots of IRQs, especially the ones signaling "replay timer
timed out" and "replay timer rolled over after 4 transmissions of the same TLP"
but also some "phy error detected on receive side"... Need to try to rework my
cable setup I guess.

As for the BARs being reset to 0, I have not checked, but it may be why I see
things not working after some inactivity. Will check that. We may be seeing the
same regarding that.

> I don't think it's a buffer overflow / out-of-bounds access by kernel
> code for two reasons
> 1) The values in the config space around the BARs is coherent and unchanged
> 2) The bars are reset to 0 and not a random value
> 
> I suspect a hardware reset of those registers issued internally in the
> PCIe controller,
> I don't know why (it might be a link related event or power state
> related event).
> 
> I have also experienced very slow behavior with the PCI endpoint test driver,
> e.g., pcitest -w 1024 -d would take tens of seconds to complete. It seems to
> come from LCRC errors, when I check the "LCRC Error count register"
> @0xFD90'0214 I can see it drastically increase between two calls of pcitest
> (when I mean drastically it means by 6607 (0x19CF) for example).
> 
> The "ECC Correctable Error Count Register" @0xFD90'0218 reads 0 though.
> 
> I have tried to shorten the cabling by removing one of the PCIe extenders, that
> didn't change the issues much.
> 
> Any ideas as to why I see a large number of TLPs with LCRC errors in them ?
> Do you experience the same ? What are your values in 0xFD90'0214 when
> running e.g., pcitest -w 1024 -d (note: you can reset the counter by writing
> 0xFFFF to it in case it reaches the maximum value of 0xFFFF).

I have not checked. But I will look at these counters to see what I have there.

-- 
Damien Le Moal
Western Digital Research