Hi Bjorn, On 8/18/2017 5:01 PM, Bjorn Helgaas wrote: > On Fri, Aug 11, 2017 at 12:56:35PM -0400, Sinan Kaya wrote: >> Sporadic reset issues have been observed with Intel 750 NVMe drive while >> assigning the physical function to the guest machine. The sequence of >> events observed is as follows: >> >> - perform a Function Level Reset (FLR) >> - sleep up to 1000ms total >> - read ~0 from PCI_COMMAND >> - warn that the device didn't return from FLR >> - touch the device before it's ready >> - drop register read and writes performing register settings restore >> - incomplete reset operation and partial register restoration >> - second time device probe fails in the guest machine as HW is left in >> limbo. > > Pardon me while I think out loud for a while. It's taking me a while > to untangle my confusion about how CRS works. > no problem, > Prior to this patch, you saw the "Failed to return from FLR" warning. > That means we did this: > > 0ms assert FLR > 100ms read PCI_COMMAND, get ~0 (i == 0) > 200ms read PCI_COMMAND, get ~0 (i == 1) > ... > 1000ms read PCI_COMMAND, get ~0 (i == 9) > > If we did not get completions for those config reads, the root port > would complete the read as a failed transaction and probably > synthesize ~0 data. That's correct. This root port returns ~0 when destination is unreachable. > But if that were the case, this patch would make > no difference: it reads PCI_VENDOR_ID instead of PCI_COMMAND the first > time around, that would also be a failed transaction, and we wouldn't > interpret it as a CRS status, so the sequence would be exactly the > above except the first read would be of PCI_VENDOR_ID, not > PCI_COMMAND. > > Since this patch *does* make a difference, CRS Software Visibility > must be supported and enabled, and we must have gotten completions > with CRS status for the PCI_COMMAND reads. That's right, CRS visibility is supported and enabled. Root port returns 0xFFFF0001 when vendor ID is read as per the spec. > Per the completion > handling rules in sec 2.3.2, the root complex should transparently > retry the read (since we're not reading PCI_VENDOR_ID, CRS Software > Visibility doesn't apply, so this is invisible to software). But the > root complex *is* allowed to limit the number of retries and if the > limit is reached, complete the request as a failed transaction. This version of the root port doesn't support retry mechanism. This is a TO-DO for the hardware team. This root port returns ~0 for all accesses other than vendor id. This means we waited 1 seconds and the device was not accessible. Command register was not reachable at the end of 1 seconds. > > That must be what happened before this patch. If that's the case, we > should see an error logged from the failed transaction. We should be > able to verify this if we collected "lspci -vv" output for the system > before and after the FLR. > See above. > If CRS SV is not present (or not enabled), the PCI_VENDOR_ID read will > still get a CRS completion, but the root port will retry it instead of > returning the 0x0001 ID. When we hit the retry limit, the root port > will synthesize ~0 data to complete the read. This is what is missing from the HW. I'm trying to get this corrected to be complete. > > That means we won't call pci_bus_wait_crs() at all, and we'll fall > back to the original PCI_COMMAND reads. That path will only wait > 1000ms, just like the original code before this patch. So I *think* > that if you disable CRS SV on this system (or if you put the device in > a system that doesn't support CRS SV), you'll probably see the same > "failed to return from FLR" warning. > > You could easily test this by using setpci to turn off > PCI_EXP_RTCTL_CRSSVE in the root port leading to the NVMe device > before the FLR. > > I think the implication is that we need to generalize > pci_bus_wait_crs() to be more of a "wait for device ready" interface > that can use CRS SV (if supported and enabled) or something like the > PCI_COMMAND loop (if no CRS SV). It should use the same timeout > either way, since CRS SV is a property of the root port, not of the > device we're waiting for. I saw your comment about timeout. I was trying not to change the behavior for systems that do not support CRS visibility. we can certainly increase the timeout if you think it is better. I saw your V11, I'll review them in a minute. > > It's "interesting" that the PCI core error handling code doesn't look > at the Master Abort bit at all, even though it's pretty fundamental to > conventional PCI error reporting. I suspect Master Abort gets set > whenever a root port synthesizes ~0 data, but (with the exception of > some arch code) we never look at it and never clear it. > >> An endpoint is allowed to issue Configuration Request Retry Status (CRS) >> following a FLR request to indicate that it is not ready to accept new >> requests. CRS is defined in PCIe r3.1, sec 2.3.1. Request Handling Rules >> and CRS usage in FLR context is mentioned in PCIe r3.1, sec 6.6.2. >> Function-Level Reset. >> >> A CRS indication will only be given if the address to be read is vendor ID >> register. pci_bus_read_dev_vendor_id() knows how to deal with CRS returned >> 0xFFFF0001 value and will continue polling until a value other than >> 0xFFFF0001 is returned within a given timeout. >> >> Try to discover device presence via CRS first. If device is not found, fall >> through to old behavior. >> >> Signed-off-by: Sinan Kaya <okaya@xxxxxxxxxxxxxx> >> --- >> drivers/pci/pci.c | 23 +++++++++++++++++++++-- >> 1 file changed, 21 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index af0cc34..c853551 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -3821,17 +3821,36 @@ static void pci_flr_wait(struct pci_dev *dev) >> { >> int i = 0; >> u32 id; >> + bool ret; >> + >> + /* >> + * Don't touch the HW before waiting 100ms. HW has to finish >> + * within 100ms according to PCI Express Base Specification >> + * Revision 3.1 Section 6.6.2: Function-Level Reset (FLR). >> + */ >> + msleep(100); >> + >> + if (pci_bus_read_config_dword(dev->bus, dev->devfn, PCI_VENDOR_ID, >> + &id)) >> + return; >> + >> + /* See if we can find a device via CRS first. */ >> + if ((id & 0xffff) == 0x0001) { >> + ret = pci_bus_wait_crs(dev->bus, dev->devfn, &id, 60000); >> + if (ret) >> + return; >> + } >> >> do { >> msleep(100); >> pci_read_config_dword(dev, PCI_COMMAND, &id); >> - } while (i++ < 10 && id == ~0); >> + } while (i++ < 9 && id == ~0); >> >> if (id == ~0) >> dev_warn(&dev->dev, "Failed to return from FLR\n"); >> else if (i > 1) >> dev_info(&dev->dev, "Required additional %dms to return from FLR\n", >> - (i - 1) * 100); >> + i * 100); >> } >> >> /** >> -- >> 1.9.1 >> > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.