Re: [PATCH v2 0/9] PCI: rockchip: Fix RK3399 PCIe endpoint controller driver

Rick Wertenbroek <rick.wertenbroek@xxxxxxxxx> · Thu, 16 Mar 2023 13:52:49 +0100

On Wed, Mar 15, 2023 at 1:00 AM Damien Le Moal
<damien.lemoal@xxxxxxxxxxxxxxxxxx> wrote:
>
> On 3/15/23 07:54, Damien Le Moal wrote:
> > On 3/14/23 23:53, Rick Wertenbroek wrote:
> >> Hello Damien,
> >> I also noticed random issues I suspect to be related to link status or power
> >> state, in my case it sometimes happens that the BARs (0-6) in the config
> >> space get reset to 0. This is not due to the driver because the driver never
> >> ever accesses these registers (@0xfd80'0010 to 0xfd80'0024 TRM
> >> 17.6.4.1.5-17.6.4.1.10).
> >> I don't think the host rewrites them because lspci shows the BARs as
> >> "[virtual]" which means they have been assigned by host but have 0
> >> value in the endpoint device (when lspci rereads the PCI config header).
> >> See https://github.com/pciutils/pciutils/blob/master/lspci.c#L422
> >>
> >> So I suspect the controller detects something related to link status or
> >> power state and internally (in hardware) resets those registers. It's not
> >> the kernel code, it never accesses these regs. The problem occurs
> >> very randomly, sometimes in a few seconds, sometimes I cannot see
> >> it for a whole day.
> >>
> >> Is this similar to what you are experiencing ?
> >
> > Yes. I sometimes get NMIs after starting the function driver, when my function
> > driver starts probing the bar registers after seeing the host changing one
> > register. And the link also comes up with 4 lanes or 2 lanes, random.

Hello, I have never had it come up with only 2 lanes, I get 4 consistently.
I have it connected through a M.2 to female PCIe 16x (4x electrically
connected),
then through a male-to-male PCIe 4x cable with TX/RX swap, then through a
16x extender. All three cables are approx 25cm. It seems stable.

> >
> >> Do you have any idea as to what could make these registers to be reset
> >> (I could not find anything in the TRM, also nothing in the driver seems to
> >> cause it).
> >
> > My thinking is that since we do not have a linkup notifier, the function driver
> > starts setting things up without the link established (e.g. when the host is
> > still powered down). Once the host start booting and pic link is established,
> > things may be reset in the hardware... That is the only thing I can think of.

This might be worth investigating, I'll look into it, but it seems
many of the EP
drivers don't have a Linkup notifier,
drivers/pci/controller/dwc/pci-dra7xx.c has
one, but most of the other EP drivers don't have them, so it might not be
absolutely required.

> >
> > And yes, there are definitely something going on with the power states too I
> > think: if I let things idle for a few minutes, everything stops working: no
> > activity seen on the endpoint over the BARs. I tried enabling the sys and client
> > interrupts to see if I can see power state changes, or if clearing the
> > interrupts helps (they are masked by default), but no change. And booting the
> > host with pci_aspm=off does not help either. Also tried to change all the
> > capabilities related to link & power states to "off" (not supported), and no
> > change either. So currently, I am out of ideas regarding that one.
> >
> > I am trying to make progress on my endpoint driver (nvme function) to be sure it
> > is not a bug there that breaks things. I may still have something bad because
> > when I enable the BIOS native NVMe driver on the host, either the host does not
> > boot, or grub crashes with memory corruptions. Overall, not yet very stable and
> > still trying to sort out the root cause of that.

I am also working on an NVMe driver but I have our NVMe firmware running in
userspace so our endpoint function driver only exposes the BARs as UIO
mapped memory and has a simple interface to generate IRQs to host / initiate
DMA transfers.

So that driver does very little in itself and I still have problems
with the BARs
getting unmapped (reset to 0) randomly. I hope your patches for monitoring
the IRQs will shed some light on this. I also observed the BARs getting reset
with the pcie ep test function driver, so I don't think it necessarily
is the function
that is to blame, rather the controller itself (also because none of
the kernel code
should / does access the BARs registers @0xfd80'0010).

>
> By the way, enabling the interrupts to see the error notifications, I do see a
> lot of retry timeout and other recoverable errors. So the issues I am seeing
> could be due to my PCI cable setup that is not ideal (bad signal, ground loops,
> ... ?). Not sure. I do not have a PCI analyzer handy :)
>
> I attached the patches I used to enable the EP interrupts. Enabling debug prints
> will tell you what is going on. That may give you some hints on your setup ?
>
> --
> Damien Le Moal
> Western Digital Research

Thank you for these patches. I will try them and see if they give me more info.

Also, I will delay the release of the v3 of my patch series because of
these issues.
The v3 only incorporates the changes discussed here in the mailing list so your
version should be up to date. If you want me to send you the series in
its current
state let me know.

But I will need some more debugging, I'll release the v3 when the driver is more
stable. I don't when, I don't have that much time on this project. Thanks for
your understanding.

Rick