Re: [PATCHv4 next 0/3] Limiting pci access

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Tue, 13 Nov 2018 00:05:11 -0600

On Tue, Dec 13, 2016 at 06:18:40PM -0500, Keith Busch wrote:
> On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> > On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote:
> > > On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote:
> > > > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> > > > > Depending on the device and the driver, there are hundreds
> > > > > to thousands of non-posted transactions submitted to the
> > > > > device to complete driver unbinding and removal. Since the
> > > > > device is gone, hardware has to handle that as an error
> > > > > condition, which is slower than the a successful non-posted
> > > > > transaction. Since we're doing 1000 of them for no
> > > > > particular reason, it takes a long time. If you hot remove a
> > > > > switch with multiple downstream devices, the serialized
> > > > > removal adds up to many seconds.
> > > > 
> > > > Another thread mentioned 1-2us as a reasonable config access
> > > > cost, and I'm still a little puzzled about how we get to
> > > > something on the order of a million times that cost.
> > > > 
> > > > I know this is all pretty hand-wavey, but 1000 config accesses
> > > > to shut down a device seems unreasonably high.  The entire
> > > > config space is only 4096 bytes, and most devices use a small
> > > > fraction of that.  If we're really doing 1000 accesses, it
> > > > sounds like we're doing something wrong, like polling without
> > > > a delay or something.
> > > 
> > > Every time pci_find_ext_capability is called on a removed
> > > device, the kernel will do 481 failed config space accesses
> > > trying to find that capability. The kernel used to do that
> > > multiple times to find the AER capability under conditions
> > > common to surprise removal.
> > 
> > Right, that's a perfect example.  I'd rather fix issues like this
> > by caching the position as you did with AER.  The "removed" bit
> > makes these issues "go away" without addressing the underlying
> > problem.
> > 
> > We might still need a "removed" bit for other reasons, but I want
> > to be clear about those reasons, not just throw it in under the
> > general "make it go fast" umbrella.
> > 
> > > But now that we cache the AER position (commit: 66b80809), we've
> > > eliminated by far the worst offender. The counts I'm telling you
> > > are still referencing the original captured traces showing long
> > > tear down times, so it's not up-to-date with the most recent
> > > version of the kernel.
> > >  
> > > > I measured the cost of config reads during enumeration using
> > > > the TSC on a 2.8GHz CPU and found the following:
> > > > 
> > > >   1580 cycles, 0.565 usec (device present)
> > > >   1230 cycles, 0.440 usec (empty slot)
> > > >   2130 cycles, 0.761 usec (unimplemented function of multi-function device)
> > > > 
> > > > So 1-2usec does seem the right order of magnitude, and my
> > > > "empty slot" error responses are actually *faster* than the
> > > > "device present" ones, which is plausible to me because the
> > > > Downstream Port can generate the error response immediately
> > > > without sending a packet down the link.  The "unimplemented
> > > > function" responses take longer than the "empty slot", which
> > > > makes sense because the Downstream Port does have to send a
> > > > packet to the device, which then complains because it doesn't
> > > > implement that function.
> > > > 
> > > > Of course, these aren't the same case as yours, where the link
> > > > used to be up but is no longer.  Is there some hardware
> > > > timeout to see if the link will come back?
> > > 
> > > Yes, the hardware does not respond immediately under this test,
> > > which is considered an error condition. This is a reason why
> > > PCIe Device Capabilities 2 Completion Timeout Ranges are
> > > recommended to be in the 10ms range.
> > 
> > And we're apparently still doing a lot of these accesses?  I'm
> > still curious about exactly what these are, because it may be that
> > we're doing more than necessary.
> 
> It's the MSI-x masking that's our next highest contributor. Masking
> vectors still requires non-posted commands, and since they're not
> going through managed API accessors like config space uses, the
> removed flag is needed for checking before doing significant MMIO.

Sorry for digging up this ancient history, but do you remember where
this MSI-X masking with non-posted commands happens?  The masking is
an MMIO write (posted) and there should only be a non-posted MMIO read
if we use the msi_set_mask_bit() path.

I'm looking at this path:

  nvme_pci_disable
    pci_free_irq_vectors
      pci_disable_msix
        pci_msix_shutdown
          if (pci_dev_is_disconnected())    # added by 0170591bb067
            return                          # (skip hw access)
          for_each_pci_msi_entry(...)       # <-- loop
            __pci_msix_desc_mask_irq
              writel                        # <-- MMIO write
          pci_read_config_word(PCI_MSIX_FLAGS)
          pci_write_config_word(PCI_MSIX_FLAGS)
        free_msi_irqs

whih only does MMIO *writes*, which I think are posted and do not
require completion and should not cause timeouts.  That's wasted
effort, I agree, but it doesn't seem like it should be a performance
issue.

So there must be another path, probably preceding this one (since
pci_disable_msix() cleans everything up), that masks the vectors and
does the non-posted reads?

The only places we do MMIO reads are msix_program_entries(), which is
done at MSI-X enable time, and msi_set_mask_bit(), which is used in
irq_chip.irq_mask() and irq_chip.irq_unmask() methods.  But I haven't
figured out where that irq_chip path is used in a loop.

Bjorn