Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

<Alex_Gagniuc@xxxxxxxxxxxx> · Thu, 8 Nov 2018 22:20:46 +0000

On 11/08/2018 02:09 PM, Bjorn Helgaas wrote:
> 
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive information.
> 
> 
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]

Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have 
been floating around the past few months? -- HPX, SFI, CER. Without 
divulging too much before publication, I'm curious on opinions on how 
well (or not well) those flows would work in general, and in linux.

> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> I'm having second thoughts about this.  One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it. >
> Another is that the only place we call pci_dev_set_disconnected() is
> in pciehp and acpiphp, so the only "disconnected" case we catch is if
> hotplug happens to be involved.  Every MMIO read from the device is an
> opportunity to learn whether it is reachable (a read from an
> unreachable device typically returns ~0 data), but we don't do
> anything at all with those.
 >
> The config accessors already check pci_dev_is_disconnected(), so this
> patch is really aimed at MMIO accesses.  I think it would be more
> robust if we added wrappers for readl() and writel() so we could
> notice read errors and avoid future reads and writes.

I wouldn't expect anything less than  complete scrutiny and quality 
control of unquestionable moral integrity :). In theory ~0 can be a 
great indicator that something may be wrong. Though I think it's about 
as ad-hoc as pci_dev_is_disconnected().

I slightly like the idea of wrapping the MMIO accessors. There's still 
memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the 
same sort of errors in PCIe land, and it would be good to have more 
testing on this. Since this patch is tested and confirmed to fix a known 
failure case, I would keep it, and the look at fixing the problem in a 
more generic way.

BTW, a lot of the problems we're fixing here come courtesy of 
firmware-first error handling. Do we reach a point where we draw a line 
in handling new problems introduced by FFS? So, if something is a 
problem with FFS, but not native handling, do we commit to supporting it?

Alex