On 11/08/2018 02:09 PM, Bjorn Helgaas wrote: > > [EXTERNAL EMAIL] > Please report any suspicious attachments, links, or requests for sensitive information. > > > [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about > PCI error recovery in general] Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have been floating around the past few months? -- HPX, SFI, CER. Without divulging too much before publication, I'm curious on opinions on how well (or not well) those flows would work in general, and in linux. > On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote: > I'm having second thoughts about this. One thing I'm uncomfortable > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. > > Another is that the only place we call pci_dev_set_disconnected() is > in pciehp and acpiphp, so the only "disconnected" case we catch is if > hotplug happens to be involved. Every MMIO read from the device is an > opportunity to learn whether it is reachable (a read from an > unreachable device typically returns ~0 data), but we don't do > anything at all with those. > > The config accessors already check pci_dev_is_disconnected(), so this > patch is really aimed at MMIO accesses. I think it would be more > robust if we added wrappers for readl() and writel() so we could > notice read errors and avoid future reads and writes. I wouldn't expect anything less than complete scrutiny and quality control of unquestionable moral integrity :). In theory ~0 can be a great indicator that something may be wrong. Though I think it's about as ad-hoc as pci_dev_is_disconnected(). I slightly like the idea of wrapping the MMIO accessors. There's still memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the same sort of errors in PCIe land, and it would be good to have more testing on this. Since this patch is tested and confirmed to fix a known failure case, I would keep it, and the look at fixing the problem in a more generic way. BTW, a lot of the problems we're fixing here come courtesy of firmware-first error handling. Do we reach a point where we draw a line in handling new problems introduced by FFS? So, if something is a problem with FFS, but not native handling, do we commit to supporting it? Alex