On Mon, Feb 24, 2020 at 05:52:14PM +0000, Spassov, Stanislav wrote: > On Mon, 2020-02-24 at 08:15 -0600, Bjorn Helgaas wrote: > > On Sun, Feb 23, 2020 at 01:20:55PM +0100, Stanislav Spassov wrote: > > > From: Wei Wang <wawei@xxxxxxxxx> > > > > > > The reasonable value for the maximum time to wait for a PCI > > > device to be ready after reset varies depending on the platform > > > and the reliability of its set of devices. > > > > There are several mechanisms in the spec for reducing these times, > > e.g., Readiness Notifications (PCIe r5.0, sec 6.23), the Readiness > > Time Reporting capability (sec 7.9.17), and ACPI _DSM methods (PCI > > Firmware Spec r3.2, sec 4.6.8, 4.6.9). > > > > I would much rather use standard mechanisms like those instead of > > adding kernel parameters. A user would have to use trial and > > error to figure out the value to use with a parameter like this, > > and that doesn't feel like a robust approach. > > I agree that supporting the standard mechanisms is highly desirable, > but some sort of fallback poll timeout value is necessary on > platforms where none of those mechanisms are supported. Arguably, > some kernel configurability (whether via a kernel parameter, as > proposed here, or some other means) is still desirable. IIUC we basically have two issues: 1) the default 60 second timeout is too long, and 2) you'd like to reduce the delays further à la the Device Readiness _DSM even for firmware that doesn't support that. The 60 second timeout came from 821cdad5c46c ("PCI: Wait up to 60 seconds for device to become ready after FLR") and is probably too long. We probably should pick a smaller value based on numbers from the spec and make quirks for devices that needed more time. As far as reducing delays for specific devices, if we can identify them via Vendor/Device ID, can we make per-device values that could be set either by the _DSM or by a quirk? I'm trying to wriggle out of adding yet more PCI kernel parameters because people tend to stumble across them and pass them around on bulletin boards as ways to "fix" or "speed up" things that really should be addressed in better ways. > I also agree there is no robust method to determine a "good value", but > then again - how was the current value for the constant determined? As > suggested in PATCH 2, the idea is to lower the global timeout to avoid > hung tasks when devices break and never come back after reset. I don't remember exactly how we came up with 60 seconds; I suspect it was just a convenient large number. Bjorn