Re: [PATCH v2 0/9] PCI: rockchip: Fix RK3399 PCIe endpoint controller driver

Rick Wertenbroek <rick.wertenbroek@xxxxxxxxx> · Thu, 16 Mar 2023 17:34:48 +0100

On Thu, Mar 16, 2023 at 1:52 PM Rick Wertenbroek
<rick.wertenbroek@xxxxxxxxx> wrote:
>
> On Wed, Mar 15, 2023 at 1:00 AM Damien Le Moal
> <damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > On 3/15/23 07:54, Damien Le Moal wrote:
> > > On 3/14/23 23:53, Rick Wertenbroek wrote:
> > >> Hello Damien,
> > >> I also noticed random issues I suspect to be related to link status or power
> > >> state, in my case it sometimes happens that the BARs (0-6) in the config
> > >> space get reset to 0. This is not due to the driver because the driver never
> > >> ever accesses these registers (@0xfd80'0010 to 0xfd80'0024 TRM
> > >> 17.6.4.1.5-17.6.4.1.10).
> > >> I don't think the host rewrites them because lspci shows the BARs as
> > >> "[virtual]" which means they have been assigned by host but have 0
> > >> value in the endpoint device (when lspci rereads the PCI config header).
> > >> See https://github.com/pciutils/pciutils/blob/master/lspci.c#L422
> > >>
> > >> So I suspect the controller detects something related to link status or
> > >> power state and internally (in hardware) resets those registers. It's not
> > >> the kernel code, it never accesses these regs. The problem occurs
> > >> very randomly, sometimes in a few seconds, sometimes I cannot see
> > >> it for a whole day.
> > >>
> > >> Is this similar to what you are experiencing ?
> > >
> > > Yes. I sometimes get NMIs after starting the function driver, when my function
> > > driver starts probing the bar registers after seeing the host changing one
> > > register. And the link also comes up with 4 lanes or 2 lanes, random.
>
> Hello, I have never had it come up with only 2 lanes, I get 4 consistently.
> I have it connected through a M.2 to female PCIe 16x (4x electrically
> connected),
> then through a male-to-male PCIe 4x cable with TX/RX swap, then through a
> 16x extender. All three cables are approx 25cm. It seems stable.
>
> > >
> > >> Do you have any idea as to what could make these registers to be reset
> > >> (I could not find anything in the TRM, also nothing in the driver seems to
> > >> cause it).
> > >
> > > My thinking is that since we do not have a linkup notifier, the function driver
> > > starts setting things up without the link established (e.g. when the host is
> > > still powered down). Once the host start booting and pic link is established,
> > > things may be reset in the hardware... That is the only thing I can think of.
>
> This might be worth investigating, I'll look into it, but it seems
> many of the EP
> drivers don't have a Linkup notifier,
> drivers/pci/controller/dwc/pci-dra7xx.c has
> one, but most of the other EP drivers don't have them, so it might not be
> absolutely required.
>
> > >
> > > And yes, there are definitely something going on with the power states too I
> > > think: if I let things idle for a few minutes, everything stops working: no
> > > activity seen on the endpoint over the BARs. I tried enabling the sys and client
> > > interrupts to see if I can see power state changes, or if clearing the
> > > interrupts helps (they are masked by default), but no change. And booting the
> > > host with pci_aspm=off does not help either. Also tried to change all the
> > > capabilities related to link & power states to "off" (not supported), and no
> > > change either. So currently, I am out of ideas regarding that one.
> > >
> > > I am trying to make progress on my endpoint driver (nvme function) to be sure it
> > > is not a bug there that breaks things. I may still have something bad because
> > > when I enable the BIOS native NVMe driver on the host, either the host does not
> > > boot, or grub crashes with memory corruptions. Overall, not yet very stable and
> > > still trying to sort out the root cause of that.
>
> I am also working on an NVMe driver but I have our NVMe firmware running in
> userspace so our endpoint function driver only exposes the BARs as UIO
> mapped memory and has a simple interface to generate IRQs to host / initiate
> DMA transfers.
>
> So that driver does very little in itself and I still have problems
> with the BARs
> getting unmapped (reset to 0) randomly. I hope your patches for monitoring
> the IRQs will shed some light on this. I also observed the BARs getting reset
> with the pcie ep test function driver, so I don't think it necessarily
> is the function
> that is to blame, rather the controller itself (also because none of
> the kernel code
> should / does access the BARs registers @0xfd80'0010).
>
> >
> > By the way, enabling the interrupts to see the error notifications, I do see a
> > lot of retry timeout and other recoverable errors. So the issues I am seeing
> > could be due to my PCI cable setup that is not ideal (bad signal, ground loops,
> > ... ?). Not sure. I do not have a PCI analyzer handy :)

I have enabled the IRQs and messages thanks to your patches but I don't get
messages from the IRQs (it seems no IRQs are fired). My PCIe link seems stable.
The main issue I face is still that after a random amount of time, the BARs are
reset to 0, I don't have a PCIe analyzer so I cannot chase config space TLPs
(e.g., host writing the BAR values to the config header), but I don't think that
the problem comes from a TLP issued from the host. (it might be).

I don't think it's a buffer overflow / out-of-bounds access by kernel
code for two reasons
1) The values in the config space around the BARs is coherent and unchanged
2) The bars are reset to 0 and not a random value

I suspect a hardware reset of those registers issued internally in the
PCIe controller,
I don't know why (it might be a link related event or power state
related event).

I have also experienced very slow behavior with the PCI endpoint test driver,
e.g., pcitest -w 1024 -d would take tens of seconds to complete. It seems to
come from LCRC errors, when I check the "LCRC Error count register"
@0xFD90'0214 I can see it drastically increase between two calls of pcitest
(when I mean drastically it means by 6607 (0x19CF) for example).

The "ECC Correctable Error Count Register" @0xFD90'0218 reads 0 though.

I have tried to shorten the cabling by removing one of the PCIe extenders, that
didn't change the issues much.

Any ideas as to why I see a large number of TLPs with LCRC errors in them ?
Do you experience the same ? What are your values in 0xFD90'0214 when
running e.g., pcitest -w 1024 -d (note: you can reset the counter by writing
0xFFFF to it in case it reaches the maximum value of 0xFFFF).

> >
> > I attached the patches I used to enable the EP interrupts. Enabling debug prints
> > will tell you what is going on. That may give you some hints on your setup ?
> >
> > --
> > Damien Le Moal
> > Western Digital Research
>
> Thank you for these patches. I will try them and see if they give me more info.
>
> Also, I will delay the release of the v3 of my patch series because of
> these issues.
> The v3 only incorporates the changes discussed here in the mailing list so your
> version should be up to date. If you want me to send you the series in
> its current
> state let me know.
>
> But I will need some more debugging, I'll release the v3 when the driver is more
> stable. I don't when, I don't have that much time on this project. Thanks for
> your understanding.
>
> Rick