On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote: > On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote: > > On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote: > > > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote: > > > > Depending on the device and the driver, there are hundreds to thousands > > > > of non-posted transactions submitted to the device to complete driver > > > > unbinding and removal. Since the device is gone, hardware has to handle > > > > that as an error condition, which is slower than the a successful > > > > non-posted transaction. Since we're doing 1000 of them for no particular > > > > reason, it takes a long time. If you hot remove a switch with multiple > > > > downstream devices, the serialized removal adds up to many seconds. > > > > > > Another thread mentioned 1-2us as a reasonable config access cost, and > > > I'm still a little puzzled about how we get to something on the order > > > of a million times that cost. > > > > > > I know this is all pretty hand-wavey, but 1000 config accesses to shut > > > down a device seems unreasonably high. The entire config space is > > > only 4096 bytes, and most devices use a small fraction of that. If > > > we're really doing 1000 accesses, it sounds like we're doing something > > > wrong, like polling without a delay or something. > > > > Every time pci_find_ext_capability is called on a removed device, the > > kernel will do 481 failed config space accesses trying to find that > > capability. The kernel used to do that multiple times to find the AER > > capability under conditions common to surprise removal. > > Right, that's a perfect example. I'd rather fix issues like this by > caching the position as you did with AER. The "removed" bit makes > these issues "go away" without addressing the underlying problem. > > We might still need a "removed" bit for other reasons, but I want to > be clear about those reasons, not just throw it in under the general > "make it go fast" umbrella. > > > But now that we cache the AER position (commit: 66b80809), we've > > eliminated by far the worst offender. The counts I'm telling you are > > still referencing the original captured traces showing long tear down > > times, so it's not up-to-date with the most recent version of the kernel. > > > > > I measured the cost of config reads during enumeration using the TSC > > > on a 2.8GHz CPU and found the following: > > > > > > 1580 cycles, 0.565 usec (device present) > > > 1230 cycles, 0.440 usec (empty slot) > > > 2130 cycles, 0.761 usec (unimplemented function of multi-function device) > > > > > > So 1-2usec does seem the right order of magnitude, and my "empty slot" > > > error responses are actually *faster* than the "device present" ones, > > > which is plausible to me because the Downstream Port can generate the > > > error response immediately without sending a packet down the link. > > > The "unimplemented function" responses take longer than the "empty > > > slot", which makes sense because the Downstream Port does have to send > > > a packet to the device, which then complains because it doesn't > > > implement that function. > > > > > > Of course, these aren't the same case as yours, where the link used to > > > be up but is no longer. Is there some hardware timeout to see if the > > > link will come back? > > > > Yes, the hardware does not respond immediately under this test, which > > is considered an error condition. This is a reason why PCIe Device > > Capabilities 2 Completion Timeout Ranges are recommended to be in the > > 10ms range. > > And we're apparently still doing a lot of these accesses? I'm still > curious about exactly what these are, because it may be that we're > doing more than necessary. It's the MSI-x masking that's our next highest contributor. Masking vectors still requires non-posted commands, and since they're not going through managed API accessors like config space uses, the removed flag is needed for checking before doing significant MMIO. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html