On Tue, Dec 13, 2016 at 06:18:40PM -0500, Keith Busch wrote: > On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote: > > On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote: > > > On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote: > > > > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote: > > > > > Depending on the device and the driver, there are hundreds > > > > > to thousands of non-posted transactions submitted to the > > > > > device to complete driver unbinding and removal. Since the > > > > > device is gone, hardware has to handle that as an error > > > > > condition, which is slower than the a successful non-posted > > > > > transaction. Since we're doing 1000 of them for no > > > > > particular reason, it takes a long time. If you hot remove a > > > > > switch with multiple downstream devices, the serialized > > > > > removal adds up to many seconds. > > > > > > > > Another thread mentioned 1-2us as a reasonable config access > > > > cost, and I'm still a little puzzled about how we get to > > > > something on the order of a million times that cost. > > > > > > > > I know this is all pretty hand-wavey, but 1000 config accesses > > > > to shut down a device seems unreasonably high. The entire > > > > config space is only 4096 bytes, and most devices use a small > > > > fraction of that. If we're really doing 1000 accesses, it > > > > sounds like we're doing something wrong, like polling without > > > > a delay or something. > > > > > > Every time pci_find_ext_capability is called on a removed > > > device, the kernel will do 481 failed config space accesses > > > trying to find that capability. The kernel used to do that > > > multiple times to find the AER capability under conditions > > > common to surprise removal. > > > > Right, that's a perfect example. I'd rather fix issues like this > > by caching the position as you did with AER. The "removed" bit > > makes these issues "go away" without addressing the underlying > > problem. > > > > We might still need a "removed" bit for other reasons, but I want > > to be clear about those reasons, not just throw it in under the > > general "make it go fast" umbrella. > > > > > But now that we cache the AER position (commit: 66b80809), we've > > > eliminated by far the worst offender. The counts I'm telling you > > > are still referencing the original captured traces showing long > > > tear down times, so it's not up-to-date with the most recent > > > version of the kernel. > > > > > > > I measured the cost of config reads during enumeration using > > > > the TSC on a 2.8GHz CPU and found the following: > > > > > > > > 1580 cycles, 0.565 usec (device present) > > > > 1230 cycles, 0.440 usec (empty slot) > > > > 2130 cycles, 0.761 usec (unimplemented function of multi-function device) > > > > > > > > So 1-2usec does seem the right order of magnitude, and my > > > > "empty slot" error responses are actually *faster* than the > > > > "device present" ones, which is plausible to me because the > > > > Downstream Port can generate the error response immediately > > > > without sending a packet down the link. The "unimplemented > > > > function" responses take longer than the "empty slot", which > > > > makes sense because the Downstream Port does have to send a > > > > packet to the device, which then complains because it doesn't > > > > implement that function. > > > > > > > > Of course, these aren't the same case as yours, where the link > > > > used to be up but is no longer. Is there some hardware > > > > timeout to see if the link will come back? > > > > > > Yes, the hardware does not respond immediately under this test, > > > which is considered an error condition. This is a reason why > > > PCIe Device Capabilities 2 Completion Timeout Ranges are > > > recommended to be in the 10ms range. > > > > And we're apparently still doing a lot of these accesses? I'm > > still curious about exactly what these are, because it may be that > > we're doing more than necessary. > > It's the MSI-x masking that's our next highest contributor. Masking > vectors still requires non-posted commands, and since they're not > going through managed API accessors like config space uses, the > removed flag is needed for checking before doing significant MMIO. Sorry for digging up this ancient history, but do you remember where this MSI-X masking with non-posted commands happens? The masking is an MMIO write (posted) and there should only be a non-posted MMIO read if we use the msi_set_mask_bit() path. I'm looking at this path: nvme_pci_disable pci_free_irq_vectors pci_disable_msix pci_msix_shutdown if (pci_dev_is_disconnected()) # added by 0170591bb067 return # (skip hw access) for_each_pci_msi_entry(...) # <-- loop __pci_msix_desc_mask_irq writel # <-- MMIO write pci_read_config_word(PCI_MSIX_FLAGS) pci_write_config_word(PCI_MSIX_FLAGS) free_msi_irqs whih only does MMIO *writes*, which I think are posted and do not require completion and should not cause timeouts. That's wasted effort, I agree, but it doesn't seem like it should be a performance issue. So there must be another path, probably preceding this one (since pci_disable_msix() cleans everything up), that masks the vectors and does the non-posted reads? The only places we do MMIO reads are msix_program_entries(), which is done at MSI-X enable time, and msi_set_mask_bit(), which is used in irq_chip.irq_mask() and irq_chip.irq_unmask() methods. But I haven't figured out where that irq_chip path is used in a loop. Bjorn