Re: Why do we check for "link-up" in *_pcie_valid_device()?

Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx> · Tue, 2 Jan 2018 11:37:50 +0000

On Fri, Dec 22, 2017 at 11:28:15AM -0600, Bjorn Helgaas wrote:
> On Fri, Dec 22, 2017 at 01:02:28PM +0000, Bharat Kumar Gogada wrote:
> > Bjorn wrote:
> >> In the PCI config access path, the *_pcie_valid_device() functions
> >> in the dwc, altera, rockchip, and xilinx drivers all check whether
> >> the link is up.
> >> 
> >> I think this is racy because the link may go down after we check but
> >> before we perform the config access.
> >> 
> >> What would blow up if we removed the *_pcie_link_up() checks?
> >> 
> >> I'd like to either remove the checks or add comments about why the
> >> race is acceptable.  If we've covered this before, I apologize.
> >> Adding a comment will keep me from pestering you about this again in
> >> the future.
> 
> > In both Xilinx driver cases when link is down, hardware responds by
> > AXI DECERR/SLVERR status which causes an exception, synchronous
> > external abort to CPU.  This causes system to hang, so we need this
> > check for both of our drivers.  We will add comments. 
> 
> This is a problem, and checking whether the link is up is a workaround
> but not a real solution.  That means your system may hang if the link
> happens to go down at the wrong time.
> 
> A real solution would be to handle the synchronous external abort
> so it doesn't cause a system hang.

I still have no idea why checking (in a broken way) that the link
is up for _every_ given config access is a solution.

If, say, the link is down in xilinx_pcie_init_port(), what would bring
it up or, worded differently, why checking that the link is up for every
config access instead of host bridge probe time make this code any safer
?

It is not safe anyway, it would be good to understand though why the code
was written this way so that we can change it appropriately.

Thanks,
Lorenzo