On Mon, 2021-09-13 at 11:38 -0500, Bjorn Helgaas wrote: > On Mon, Sep 13, 2021 at 04:29:51PM +0000, Spassov, Stanislav wrote: > > On Sat, 2021-09-11 at 09:03 -0500, Bjorn Helgaas wrote: > > > > I later understood the specific CPU did have a proprietary register for > > "limiting the number of loops" that the PCIe spec talks about, and indeed > > that register was set to "no limit". Coupled with the stuck device, these > > indefinite retries eventually triggered TOR timeout. > > "No limit" sounds like a pretty bad choice, given that it means the > CPU will essentially hang forever because of a defective I/O device. > There should be a timeout so software can recover (the *device* may > never recover, but that's no reason why the kernel must crash). > Correct. "No limit" is definitely a bad choice for that register, and fixing the value would be preferable to any software solution. Unfortunately, at least in the case I worked on, that register was not accessible by the kernel. Intel exposes many CPU configuration registers in terms of virtual PCI devices residing directly on Root Buses, and the system/platform firmware is able to use vendor-provided means to completely hide some of these pseudo-devices from the OS. Additionally, the way the PCIe spec is phrased, not every Root Complex implementation is required to even have such a limiting register, while all implementations that advertise CRS SV capability are required to behave as prescribed when PCI_VENDOR_ID is read. Hence why I believe this patch is a general robustness improvement, rather than a workaround for a specific device/platform. Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879