On 8/10/2017 5:52 PM, Bjorn Helgaas wrote: > On Tue, Aug 08, 2017 at 08:57:24PM -0400, Sinan Kaya wrote: >> Sporadic reset issues have been observed with Intel 750 NVMe drive by >> writing to the reset file in sysfs in a loop. The sequence of events >> observed is as follows: >> >> - perform a Function Level Reset (FLR) >> - sleep up to 1000ms total >> - read ~0 from PCI_COMMAND >> - warn that the device didn't return from FLR >> - touch the device before it's ready > > What's the driver-visible or user-visible effect of touching the > device before it's ready? This sequence is sort of like the joke > without the punch line :) The issue was first reported as a sporadic failure to assign NVMe physical function to the guest machine and observing NVMe drive initialization failures in the guest machine. I then repeated the problem by creating a reset loop. "NVMe (non-volatile memory) is not passing through to Virtual machine. NVMe is listed using lspci, however, the device is not listed using lsblk and fdisk -l. dmesg from virtual machine shows root@null-8cfdf006971f:~# dmesg | grep nvme [ 1.648842] nvme nvme0: pci function 0000:02:01.0 [ 1.650176] nvme 0000:02:01.0: enabling device (0000 -> 0002) [ 2.547091] [<ffff000000a4c8e0>] nvme_irq [nvme] [ 70.795558] nvme nvme0: I/O 203 QID 0 timeout, disable controller [ 70.904369] nvme nvme0: Removing after probe failure status: -4" I can include this into the commit message. > >> An endpoint is allowed to issue Configuration Request Retry Status (CRS) >> following a FLR request to indicate that it is not ready to accept new >> requests. CRS is defined in PCIe r3.1, sec 2.3.1. Request Handling Rules >> and CRS usage in FLR context is mentioned in PCIe r3.1a, sec 6.6.2. >> Function-Level Reset. > > You reference both PCIe r3.1 and r3.1a. Is there something new in > r3.1a in this area? If not, just reference r3.1 for both cases. I can use r3.1. I have r3.1a. I'm not sure if the chapter numbers match or not. > >> A CRS indication will only be given if the address to be read is vendor ID >> register. pci_bus_read_dev_vendor_id() knows how to deal with CRS returned >> 0xFFFF0001 value and will continue polling until a value other than >> 0xFFFF0001 is returned within a given timeout. >> >> Try to discover device presence via CRS first. If device is not found, fall >> through to old behavior. >> >> Signed-off-by: Sinan Kaya <okaya@xxxxxxxxxxxxxx> >> --- >> drivers/pci/pci.c | 26 ++++++++++++++++++++++++-- >> 1 file changed, 24 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index af0cc34..4366299 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -3821,17 +3821,39 @@ static void pci_flr_wait(struct pci_dev *dev) >> { >> int i = 0; >> u32 id; >> + bool ret; >> + >> + /* >> + * Don't touch the HW before waiting 100ms. HW has to finish >> + * within 100ms according to PCI Express Base Specification >> + * Revision 3.1 Section 6.6.2: Function-Level Reset (FLR). >> + */ >> + msleep(100); >> + >> + /* >> + * See if we can find a device via CRS first. Physical functions >> + * return from here if found, Virtual functions fall through as >> + * they return ~0 on vendor id read once CRS is completed. >> + */ >> + ret = pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &id, >> + 60000); >> + if (ret) >> + return; > > Alex was concerned that pci_bus_read_dev_vendor_id() could return > false ("no device here") with an ID value of ~0 for a functional VF > [1]. > > I think that is indeed possible: > > - a VF that is ready will return 0xffff as both Vendor ID and Device > ID (SR-IOV r1.1, sec 3.4.1.1, .2), so the very first read in > pci_bus_read_dev_vendor_id() would see 0xffffffff and return > "false". > > - a VF that needs more time will return CRS and we'll loop in > pci_bus_read_dev_vendor_id() until it becomes ready, and we'll > return "true". > > Falling into the code below for the "false" case probably will work, > but it's a little bit ugly because (a) we're using two mechanisms to > figure out when the device is ready for config requests, and (b) we > have to worry about VFs both in pci_bus_read_dev_vendor_id() and here > in the caller. OK, I'm open to improving the code. > > Here's an idea to make pci_bus_read_dev_vendor_id() work for both VFs > and PFs. It can't distinguish the 0xffffffff from a VF vs one from a > non-existent device, but the caller might be able to pass that > information in, e.g., when we're enumerating and don't know whether > the device exists, we don't have a pci_dev and would use this: How about creating a pci_bus_wait_crs() function with the loop in pci_bus_read_dev_vendor_id() and calling it from both places? > > pci_bus_read_dev_vendor_id(bus, devfn, &l, 60000, 0) > > While places where we do have a pci_dev and expect the device to exist > (e.g., waiting after a reset), would do this: > > pci_bus_read_dev_vendor_id(bus, devfn, &l, 60000, dev->is_virtfn) > > And we would skip the 0xffffffff check for VFs, e.g., > > bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l, > int crs_timeout, int is_vf) > { > if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)) > return false; > > if (!is_vf && > (*l == 0xffffffff || *l == 0x00000000 || > *l == 0x0000ffff || *l == 0xffff0000)) > return false; > > while ((*l & 0xffff) == 0x0001) { > ... > if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)) > return false; > } > > return true; > } > > Would that work? > > I don't know if this would be the best solution. This is a messy > area. We're relying on the host bridge to fabricate the 0xffff data > for non-existent devices, and most (maybe all) do that, but I don't > think that's actually in the spec. AFAIK, returning 0xFFFFFFFF is a very typical practice. I don't know the spec reference either. > >> + pci_read_config_dword(dev, PCI_COMMAND, &id); >> + if (id != ~0) >> + return; >> >> do { >> msleep(100); >> pci_read_config_dword(dev, PCI_COMMAND, &id); >> - } while (i++ < 10 && id == ~0); >> + } while (i++ < 9 && id == ~0); >> >> if (id == ~0) >> dev_warn(&dev->dev, "Failed to return from FLR\n"); >> else if (i > 1) >> dev_info(&dev->dev, "Required additional %dms to return from FLR\n", >> - (i - 1) * 100); >> + i * 100); >> } >> >> /** > > [1] http://lkml.kernel.org/r/20170221135138.791ba4e2@xxxxxxxxxx > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.