On Mon, Sep 15, 2014 at 10:10:20PM -0700, Rajat Jain wrote: > Hi Bjorn, > > On Mon, Sep 8, 2014 at 10:38 PM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote: > > On Tue, Sep 02, 2014 at 04:26:00PM -0700, Rajat Jain wrote: > >> > >> As per the PCIe spec, an endpoint may return the configuration cycles > >> with CRS if it is not yet fully ready to be accessed. If the CRS visibility > >> is not enabled at the root port, the spec leaves the retry behaviour open > >> to implementation in such a case. The Intel root ports have chosen to retry > >> endlessly in this situation. Thus, the root controller may "hang" (repeatedly > >> retrying the configuration requests until it gets a status other than CRS) if > >> a device returns CRS for a long time. This can cause a broken endpoint to bring > >> down the whole PCI hierarchy. > >> > >> This was recently known to cause problems on Intel systems and > >> was discussed here: > >> http://marc.info/?t=140926298500002&r=1&w=2 > >> > >> Ref1: > >> https://www.pcisig.com/specifications/pciexpress/ECN_CRS_Software_Visibility_No27.pdf > >> > >> Ref2: > >> PCIe spec V3.0, pg119, pg127 for "Configuration Request Retry Status" > >> > >> Thus enable the CRS visibility for the root ports that support it. This > >> patch reverts the following commit, but enables CRS visibility only > >> when the root port supports it: > >> > >> ad7edfe04908 ("[PCI] Do not enable CRS Software Visibility by default") > >> > >> (Linus' response: http://marc.info/?l=linux-pci&m=140968622520192&w=2) > >> > >> Signed-off-by: Rajat Jain <rajatxjain@xxxxxxxxx> > >> Signed-off-by: Rajat Jain <rajatjain@xxxxxxxxxxx> > >> Signed-off-by: Guenter Roeck <groeck@xxxxxxxxxxx> > > > > I put this and the "only look at Vendor ID" patch on a pci/enumeration > > branch [1]. I rewrote the changelogs to reflect my understanding of what's > > going on, so probably the real truth is somewhere between your original and > > mine. Let me know what should be fixed. > > > > We should figure out an easy way for Josh to test these. Ideally, he could > > test the second patch by itself first, then both together. > > OK, Josh and I tested this over the last week on his HW (the HW that > had originally reported the problem). Somehow his hardware does not > show the problem in ANY case. I tried the following, and the original > issue (vendor id = 1) was never seen: > > 1) 3.17-rc2 (has CRS disabled) > 2) 3.17-rc2 + Enable CRS > 3) 3.17-rc2 + Enable CRS + Ignore Device ID > > The Device always returned the correct Vendor ID and Device ID in all > cases. Thus even enabling CRS does not make his system fail in anyway. Thanks a lot for all the work to dig out the board and test it. I really appreciate it. My inclination is to apply both patches. It doesn't seem strictly necessary to ignore the device ID on this platform, but I don't think we gain anything by verifying that device ID == 0xffff except confirming spec compliance. We *could* put more effort into reproducing the original problem, e.g., by building v2.6.24-rc1, where this problem was originally reported, and (hopefully) reproducing it there, then figuring out where it got fixed along the way. But I'm not sure it's worth the effort. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html