On Wed, Aug 23, 2017 at 2:03 AM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > [+cc Sinan, Timur, Alex] > > Hi Oza, > > On Mon, Aug 21, 2017 at 09:28:43PM +0530, Oza Pawandeep wrote: >> PCIe spec r3.1, sec 2.3.2 >> If CRS software visibility is not enabled, the RC must reissue the >> config request as a new request. >> >> - If CRS software visibility is enabled, >> - for a config read of Vendor ID, the RC must return 0x0001 data >> - for all other config reads/writes, the RC must reissue the >> request > > You mentioned early on that this fixes an issue with NVMe after a > reset. Is that reset an FLR? I'm wondering if this overlaps with the > work Sinan is doing: > > http://lkml.kernel.org/r/20170818212310.15145.21732.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > > With the current code in the tree, pci_flr_wait() probably waits only > 100ms on your system. It currently reads PCI_COMMAND and probably > sees 0xffff0001 because the NVMe device returns a CRS completion > (which this iproc RC incorrectly turns into 0xffff0001 data), so we > exit the loop after a single msleep(100). > > Sinan is extending pci_flr_wait() to read the Vendor ID and look for > the CRS SV indication. If this is where you're seeing the NVMe issue, > I bet we can fix this by simply changing iproc_pcie_config_read() to > return the correct data when it sees a CRS completion. Then the > retry loop in pci_flr_wait() will work as intended. > The issue with User Space poll mode drivers resulting into CRS. root@bcm958742k:~# /usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe traddr:0000:01:00.0' -q 128 -s 4096 -w read -d 2048 -t 30 -c 0x1 Starting DPDK 16.11.1 initialization... [ DPDK EAL parameters: perf -c 1 -m 2048 --file-prefix=spdk_pid2202 ] EAL: Detected 8 lcore(s) EAL: Probing VFIO support... EAL: VFIO support initialized EAL: cannot open /proc/self/numa_maps, consider that all memory is in socket_id 0 Initializing NVMe Controllers EAL: PCI device 0000:01:00.0 on NUMA socket 0 EAL: probe driver: 8086:953 spdk_nvme EAL: using IOMMU type 1 (Type 1) [ 45.811353] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x2a0 Attaching to NVMe Controller at 0000:01:00.0 [8086:0953] [ 46.486601] vfio_bar_restore: 0000:01:00.0 reset recovery - restoring bars nvme_ctrlr.c:1237:nvme_ctrlr_process_init: ***ERROR*** Initialization timed out in state 3 nvme_ctrlr.c: 414:nvme_ctrlr_shutdown: ***ERROR*** did not shutdown within 5 seconds EAL: Hotplug doesn't support vfio yet spdk_nvme_probe() failed for transport address '0000:01:00.0' /usr/share/spdk/examples/nvme/perf: errors occured ok, so this is in the context of [ 52.500579] [<ffff000008088498>] dump_backtrace+0x0/0x208 [ 52.506147] [<ffff0000080886b4>] show_stack+0x14/0x20 [ 52.511357] [<ffff0000083ace34>] dump_stack+0x8c/0xb0 [ 52.516567] [<ffff00000840a550>] pci_flr_wait+0x18/0xc8 [ 52.521955] [<ffff00000840bf9c>] pcie_flr+0x34/0x68 [ 52.526986] [<ffff00000840e3d0>] __pci_dev_reset+0x100/0x2c0 [ 52.532823] [<ffff00000840e60c>] pci_try_reset_function+0x64/0x80 [ 52.539108] [<ffff00000856e4b8>] vfio_pci_ioctl+0x398/0xb78 [ 52.544857] [<ffff000008568378>] vfio_device_fops_unl_ioctl+0x20/0x30 [ 52.551501] [<ffff0000081d4e24>] do_vfs_ioctl+0xa4/0x738 [ 52.556980] [<ffff0000081d5544>] SyS_ioctl+0x8c/0xa0 [ 52.562100] [<ffff000008082f8c>] __sys_trace_return+0x0/0x4 So, I have two things to do here. 1) remove retry funciton and just do following. data = readl(cfg_data_p); if (data == CFG_RETRY_STATUS) /* 0xffff0001 */ data = 0xffffffff; 2) refactor the code with separate patch. but for that Sinan's patch is required. then with both the patches I think, things will work out correctly for iproc SOC. Let me know how this sounds and if you agree with above two points, I will be posting final set of patches. let me know if you want me to change anything in the commit description. in my opinion I should remove config write description which is irrelevant here. Regards, Oza. > Sec 2.3.2 says the RC may limit the number of retries, and it doesn't > say anything about how many retries the RC should do or what interval > should be between them. So I think it should be legal to say "this RC > does zero retries". > > I think an RC that does zero retries and doesn't support CRS SV would > always handle a CRS completion as a "failed transaction" so it would > synthesize 0xffffffff data (and, I assume, set an error bit like > Received Master Abort). If we treated this controller as though it > doesn't support CRS SV, the workaround in iproc_pcie_config_read() > would be simply: > > data = readl(cfg_data_p); > if (data == CFG_RETRY_STATUS) /* 0xffff0001 */ > data = 0xffffffff; > > That would make the loop in pci_flr_wait() that reads PCI_COMMAND work > correctly. It would get 0xffffffff data as long as the device > returned CRS completions. > > It wouldn't make Sinan's new CRS SV code work, but that's OK: that all > has to work correctly even on systems that don't support CRS SV. > > The only problem is that when we read a non-Vendor ID register that > happens to really be 0xffff0001, we *should* return 0xffff0001, but we > incorrectly return 0xffffffff. But we already know we just have to > live with that issue. > > One minor refactoring comment below. > >> iproc PCIe Controller spec: >> 4.7.3.3. Retry Status On Configuration Cycle >> Endpoints are allowed to generate retry status on configuration >> cycles. In this case, the RC needs to re-issue the request. The IP >> does not handle this because the number of configuration cycles needed >> will probably be less than the total number of non-posted operations >> needed. >> >> When a retry status is received on the User RX interface for a >> configuration request that was sent on the User TX interface, >> it will be indicated with a completion with the CMPL_STATUS field set >> to 2=CRS, and the user will have to find the address and data values >> and send a new transaction on the User TX interface. >> When the internal configuration space returns a retry status during a >> configuration cycle (user_cscfg = 1) on the Command/Status interface, >> the pcie_cscrs will assert with the pcie_csack signal to indicate the >> CRS status. >> When the CRS Software Visibility Enable register in the Root Control >> register is enabled, the IP will return the data value to 0x0001 for >> the Vendor ID value and 0xffff (all 1’s) for the rest of the data in >> the request for reads of offset 0 that return with CRS status. This >> is true for both the User RX Interface and for the Command/Status >> interface. When CRS Software Visibility is enabled, the CMPL_STATUS >> field of the completion on the User RX Interface will not be 2=CRS and >> the pcie_cscrs signal will not assert on the Command/Status interface. >> >> Per PCIe r3.1, sec 2.3.2, config requests that receive completions >> with Configuration Request Retry Status (CRS) should be reissued by >> the hardware except reads of the Vendor ID when CRS Software >> Visibility is enabled. >> >> This hardware never reissues configuration requests when it receives >> CRS completions. >> Note that, neither PCIe host bridge nor PCIe core re-issues the >> request for any configuration offset. >> As a result of the fact, PCIe RC driver (sw) should take care of >> retrying. >> >> For config writes, there is no way for this driver to learn about >> CRS completions. A write that receives a CRS completion will not be >> retried and the data will be discarded. This is a potential data >> corruption. >> >> For config reads, this hardware returns CFG_RETRY_STATUS data when >> it receives a CRS completion for a config read, regardless of the >> address of the read or the CRS Software Visibility Enable bit. As a >> partial workaround for this, we retry in software any read that >> returns CFG_RETRY_STATUS. >> >> It implements iproc_pcie_config_read which gets called for Stingray. >> For other iproc based SOC, it falls back to PCI generic APIs. >> >> Signed-off-by: Oza Pawandeep <oza.oza@xxxxxxxxxxxx> >> Reviewed-by: Ray Jui <ray.jui@xxxxxxxxxxxx> >> Reviewed-by: Scott Branden <scott.branden@xxxxxxxxxxxx> >> >> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c >> index c574863..a3edae4 100644 >> --- a/drivers/pci/host/pcie-iproc.c >> +++ b/drivers/pci/host/pcie-iproc.c >> @@ -68,6 +68,9 @@ >> #define APB_ERR_EN_SHIFT 0 >> #define APB_ERR_EN BIT(APB_ERR_EN_SHIFT) >> >> +#define CFG_RETRY_STATUS 0xffff0001 >> +#define CFG_RETRY_STATUS_TIMEOUT_US 500000 /* 500 milli-seconds. */ >> + >> /* derive the enum index of the outbound/inbound mapping registers */ >> #define MAP_REG(base_reg, index) ((base_reg) + (index) * 2) >> >> @@ -448,6 +451,89 @@ static inline void iproc_pcie_apb_err_disable(struct pci_bus *bus, >> } >> } >> >> +static unsigned int iproc_pcie_cfg_retry(void __iomem *cfg_data_p) >> +{ >> + int timeout = CFG_RETRY_STATUS_TIMEOUT_US; >> + unsigned int data; >> + >> + /* >> + * As per PCIe spec r3.1, sec 2.3.2, CRS Software >> + * Visibility only affects config read of the Vendor ID. >> + * For config write or any other config read the Root must >> + * automatically re-issue configuration request again as a >> + * new request. >> + * >> + * For config reads, this hardware returns CFG_RETRY_STATUS data when >> + * it receives a CRS completion for a config read, regardless of the >> + * address of the read or the CRS Software Visibility Enable bit. As a >> + * partial workaround for this, we retry in software any read that >> + * returns CFG_RETRY_STATUS. >> + */ >> + data = readl(cfg_data_p); >> + while (data == CFG_RETRY_STATUS && timeout--) { >> + udelay(1); >> + data = readl(cfg_data_p); >> + } >> + >> + if (data == CFG_RETRY_STATUS) >> + data = 0xffffffff; >> + >> + return data; >> +} >> + >> +static void __iomem *iproc_pcie_map_ep_cfg_reg(struct iproc_pcie *pcie, >> + unsigned int busno, >> + unsigned int slot, >> + unsigned int fn, >> + int where) >> +{ >> + u16 offset; >> + u32 val; >> + >> + /* EP device access */ >> + val = (busno << CFG_ADDR_BUS_NUM_SHIFT) | >> + (slot << CFG_ADDR_DEV_NUM_SHIFT) | >> + (fn << CFG_ADDR_FUNC_NUM_SHIFT) | >> + (where & CFG_ADDR_REG_NUM_MASK) | >> + (1 & CFG_ADDR_CFG_TYPE_MASK); >> + >> + iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val); >> + offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA); >> + >> + if (iproc_pcie_reg_is_invalid(offset)) >> + return NULL; >> + >> + return (pcie->base + offset); >> +} >> + >> +static int iproc_pcie_config_read(struct pci_bus *bus, unsigned int devfn, >> + int where, int size, u32 *val) >> +{ >> + struct iproc_pcie *pcie = iproc_data(bus); >> + unsigned int slot = PCI_SLOT(devfn); >> + unsigned int fn = PCI_FUNC(devfn); >> + unsigned int busno = bus->number; >> + void __iomem *cfg_data_p; >> + u32 data; >> + >> + /* root complex access. */ >> + if (busno == 0) >> + return pci_generic_config_read32(bus, devfn, where, size, val); >> + >> + cfg_data_p = iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where); >> + >> + if (!cfg_data_p) >> + return PCIBIOS_DEVICE_NOT_FOUND; >> + >> + data = iproc_pcie_cfg_retry(cfg_data_p); >> + >> + *val = data; >> + if (size <= 2) >> + *val = (data >> (8 * (where & 3))) & ((1 << (size * 8)) - 1); >> + >> + return PCIBIOS_SUCCESSFUL; >> +} >> + >> /** >> * Note access to the configuration registers are protected at the higher layer >> * by 'pci_lock' in drivers/pci/access.c >> @@ -459,7 +545,6 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie, >> { >> unsigned slot = PCI_SLOT(devfn); >> unsigned fn = PCI_FUNC(devfn); >> - u32 val; >> u16 offset; >> >> /* root complex access */ >> @@ -484,18 +569,7 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie, >> if (slot > 0) >> return NULL; >> >> - /* EP device access */ >> - val = (busno << CFG_ADDR_BUS_NUM_SHIFT) | >> - (slot << CFG_ADDR_DEV_NUM_SHIFT) | >> - (fn << CFG_ADDR_FUNC_NUM_SHIFT) | >> - (where & CFG_ADDR_REG_NUM_MASK) | >> - (1 & CFG_ADDR_CFG_TYPE_MASK); >> - iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val); >> - offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA); >> - if (iproc_pcie_reg_is_invalid(offset)) >> - return NULL; >> - else >> - return (pcie->base + offset); >> + return iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where); > > Thanks for factoring this out. Can you please split that refactor > into a separate patch? That will make this CRS-handling patch smaller > and simpler. > >> } >> >> static void __iomem *iproc_pcie_bus_map_cfg_bus(struct pci_bus *bus, >> @@ -554,8 +628,13 @@ static int iproc_pcie_config_read32(struct pci_bus *bus, unsigned int devfn, >> int where, int size, u32 *val) >> { >> int ret; >> + struct iproc_pcie *pcie = iproc_data(bus); >> >> iproc_pcie_apb_err_disable(bus, true); >> + if (pcie->type == IPROC_PCIE_PAXB_V2) >> + ret = iproc_pcie_config_read(bus, devfn, where, size, val); >> + else >> + ret = pci_generic_config_read32(bus, devfn, where, size, val); >> ret = pci_generic_config_read32(bus, devfn, where, size, val); >> iproc_pcie_apb_err_disable(bus, false); >> >> -- >> 1.9.1 >>