On Mon, Sep 13, 2021 at 11:04 AM Spassov, Stanislav <stanspas@xxxxxxxxx> wrote: > > On Mon, 2021-09-13 at 11:38 -0500, Bjorn Helgaas wrote: > > On Mon, Sep 13, 2021 at 04:29:51PM +0000, Spassov, Stanislav wrote: > > > On Sat, 2021-09-11 at 09:03 -0500, Bjorn Helgaas wrote: > > > > > > I later understood the specific CPU did have a proprietary register for > > > "limiting the number of loops" that the PCIe spec talks about, and indeed > > > that register was set to "no limit". Coupled with the stuck device, these > > > indefinite retries eventually triggered TOR timeout. > > > > "No limit" sounds like a pretty bad choice, given that it means the > > CPU will essentially hang forever because of a defective I/O device. > > There should be a timeout so software can recover (the *device* may > > never recover, but that's no reason why the kernel must crash). > > > > Correct. "No limit" is definitely a bad choice for that register, > and fixing the value would be preferable to any software solution. > > Unfortunately, at least in the case I worked on, that register was > not accessible by the kernel. I can acknowledge that I have across exactly the same issue (no limit on retries, results in CPU hang) on another old Intel root port too in the past: https://lore.kernel.org/linux-pci/53FFA54D.9000907@xxxxxxxxx/ https://lkml.org/lkml/2014/8/1/186 and had the same problem (no way to limit the number of retries). I'd be interested and will keep a lookout for the next patch Stanislav sends out! Thanks! Rajat > Intel exposes many CPU configuration > registers in terms of virtual PCI devices residing directly on Root > Buses, and the system/platform firmware is able to use vendor-provided > means to completely hide some of these pseudo-devices from the OS. > > Additionally, the way the PCIe spec is phrased, not every Root Complex > implementation is required to even have such a limiting register, while > all implementations that advertise CRS SV capability are required to > behave as prescribed when PCI_VENDOR_ID is read. Hence why I believe > this patch is a general robustness improvement, rather than a workaround > for a specific device/platform. > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >