Re: [PATCH v6 3/3] PCI: xilinx-xdma: Add Xilinx XDMA Root Port driver

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Mon, 28 Aug 2023 16:11:21 -0500

On Mon, Aug 28, 2023 at 12:01:52PM +0000, Havalige, Thippeswamy wrote:
> > > On Fri, Aug 18, 2023 at 03:05:07PM +0530, Thippeswamy Havalige wrote:
> > > > ...

> > > > +static bool xilinx_pl_dma_pcie_valid_device(struct pci_bus *bus,
> > > > +					    unsigned int devfn)
> > > > +{
> > > > +	struct pl_dma_pcie *port = bus->sysdata;
> > > > +
> > > > +	/* Check if link is up when trying to access downstream ports */
> > > > +	if (!pci_is_root_bus(bus)) {
> > > > +		/*
> > > > +		 * Checking whether link is up here is a last line of defence,
> > > > +		 * if the link goes down after we check for link-up, we have a
> > > > +		 * problem: if a PIO request is initiated while link-down, the
> > > > +		 * whole controller hangs, and even after link comes up again,
> > > > +		 * previous PIO requests won't work, and a reset of the whole
> > > > +		 * PCIe controller is needed. Henceforth we need link-up
> > > check
> > > > +		 * here to avoid sending PIO request when link is down. This
> > > > +		 * check is racy by definition and does not make controller
> > > hang
> > > > +		 * if the link goes down after this check is performed.
> > >
> > > This comment doesn't make sense to me.  "If PIO request initiated
> > > while link- down, controller hangs ... This check is racy and does not
> > > make controller hang if link goes down."  Which is it?
> - Here checking link up treats device as invalid.
> 
> Please find comment that I ll update in next patch and 
> Please letme know if any changes are needed.
> 
>   /*
>                  * Checking whether the link is up. Here is the last line of
>                  * defence. If the link goes down after we check for link-up,
>                  * we have a problem. If a PIO request is initiated while link
>                  * is down, the whole controller hangs. Even after link comes up
>                  * again, previous PIO requests won't work, and a reset of the
>                  * whole PCIe controller is needed. Henceforth we need link-up
>                  * check here to treat device as invalid and avoid sending PIO
>                  * request when link is down and this check is inherently racy
>                  * by definition.
> */
> > >
> > > My *guess* is that this check narrows the window but doesn't close it,
> > > so if
> > > xilinx_pl_dma_pcie_link_up() finds the link up, but the link goes down
> > > before
> > > pci_generic_config_read() initiates the PIO request, the controller
> > > hangs, and a reset is required.

Sorry, I dragged this out by not giving you a more useful suggestion
to begin with.  Since advk_pcie_valid_device() has the same issue,
copying its comment was a great place to start.

But I think advk_pcie_valid_device(), altera_pcie_valid_device(),
nwl_pcie_valid_device(), xilinx_pcie_valid_device(), and now
xilinx_pl_dma_pcie_valid_device() all have the same race, but none of
them really address it in the comment.

I'm looking for explicit acknowledgement that we can't reliably
prevent this unrecoverable error, e.g., something like this:

  Sending a PIO request to a downstream device when the link is down
  causes an unrecoverable error, and a reset of the entire PCIe
  controller will be needed.  We can reduce the likelihood of that
  unrecoverable error by checking whether the link is up, but can't
  completely prevent it because the link may go down between the
  link-up check and the PIO request.

Bjorn