Re: [PATCH 22/30] PCI: tegra: Access endpoint config only if PCIe link is up

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Mon, 15 Apr 2019 09:04:30 -0500

On Mon, Apr 15, 2019 at 05:06:10PM +0530, Manikanta Maddireddy wrote:
> On 12-Apr-19 8:20 PM, Bjorn Helgaas wrote:
> > On Fri, Apr 12, 2019 at 12:30:22PM +0530, Manikanta Maddireddy wrote:
> >> On 12-Apr-19 1:45 AM, Bjorn Helgaas wrote:
> >>> On Thu, Apr 11, 2019 at 10:33:47PM +0530, Manikanta Maddireddy wrote:
> >>>> Add PCIe link up check in config read and write callback functions
> >>>> before accessing endpoint config registers.
> >>>>  static int tegra_pcie_config_read(struct pci_bus *bus, unsigned int devfn,
> >>>>  				  int where, int size, u32 *value)
> >>>>  {
> >>>> +	struct tegra_pcie *pcie = bus->sysdata;
> >>>> +	struct pci_dev *bridge;
> >>>> +	struct tegra_pcie_port *port;
> >>>> +
> >>>>  	if (bus->number == 0)
> >>>>  		return pci_generic_config_read32(bus, devfn, where, size,
> >>>>  						 value);
> >>>>  
> >>>> +	bridge = pcie_find_root_port(bus->self);
> >>>> +
> >>>> +	list_for_each_entry(port, &pcie->ports, list)
> >>>> +		if (port->index + 1 == PCI_SLOT(bridge->devfn))
> >>>> +			break;
> >>>> +
> >>>> +	/* If there is no link, then there is no device */
> >>>> +	if (!tegra_pcie_link_status(port)) {
> >>> This is racy and you should avoid it if possible.  The link could go down
> >>> between calling tegra_pcie_link_status() and issuing the config read/write.
> >>>
> >>> If your driver is to be reliable, it must be able to handle any bad
> >>> consequence of issuing that config read/write anyway, so I think it's
> >>> better if it doesn't even bother checking whether the link is up.
> >> This change is made based on similar check present in dwc driver
> >> dw_pcie_valid_device(), reasons for making this change in Tegra might
> >> differ dwc.
> > Yes, you won't be surprised to learn that I don't like the similar
> > checks in dwc, altera, xilinx, and xilinx-nwl either :)  I raise this
> > issue every time I see it, but I can't remember if I've mentioned dwc
> > specifically.
> >
> > We need to either eradicate this pattern of checking for link up, or
> > include a comment about why it is absolutely necessary.
> 
> This patch is created to address below scenario in our downstream kernel,
> 1) Our platform has WiFi on one slot and GPU in another.
> 2) During WiFi OFF, link is put in L2 and it goes through hot reset
> when turning ON WiFi (since Tegra doesn't support hot-plug).
> 3) Whenever x11 server is started it scans the PCIe bus for video devices.
> Here PCIe configuration registers of all devices are read to find out
> all available video devices.
> 4) If "x11 server" started with WiFi OFF, then we are seeing "response
> decoding error"(Tegra AFI module specific error).

Probably happens with lspci too.  I guess this is when you try to read
the WiFi device config space?

> Best solution we came up with is to have link up check in config access
> callback functions.

Can you check for "response decoding error" in the config accessor and
return 0xffffffff if you see it?

> >> Intention here is to reduce the number of AER errors when device is
> >> falling off the bus or going through hot reset. So racy condition here is
> >> OK
> >
> > I'm not convinced about this.  The issues you mention need to be
> > solved in a generic way, not a tegra-specific way.
> >
> > We don't want to end up with code that silently avoids the config
> > access 99.99% of the time, but once in a blue moon, we lose the race
> > (the device stops responding after we've determined the link is up)
> > and the access causes a mysterious AER error that we have no way to
> > debug.
> >
> >>>> +		*value = 0xffffffff;
> >>>> +		return PCIBIOS_DEVICE_NOT_FOUND;
> >>>> +	}
> >>>> +
> >>>>  	return pci_generic_config_read(bus, devfn, where, size, value);
> >>>>  }