Re: [PATCHv4 next 0/3] Limiting pci access

Keith Busch <keith.busch@xxxxxxxxx> · Tue, 13 Dec 2016 18:18:40 -0500



On Tue, Dec 13, 2016 at 02:50:12PM -0600, Bjorn Helgaas wrote:
> On Mon, Dec 12, 2016 at 07:55:47PM -0500, Keith Busch wrote:
> > On Mon, Dec 12, 2016 at 05:42:27PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Dec 08, 2016 at 02:32:53PM -0500, Keith Busch wrote:
> > > > Depending on the device and the driver, there are hundreds to thousands
> > > > of non-posted transactions submitted to the device to complete driver
> > > > unbinding and removal. Since the device is gone, hardware has to handle
> > > > that as an error condition, which is slower than the a successful
> > > > non-posted transaction. Since we're doing 1000 of them for no particular
> > > > reason, it takes a long time. If you hot remove a switch with multiple
> > > > downstream devices, the serialized removal adds up to many seconds.
> > > 
> > > Another thread mentioned 1-2us as a reasonable config access cost, and
> > > I'm still a little puzzled about how we get to something on the order
> > > of a million times that cost.
> > > 
> > > I know this is all pretty hand-wavey, but 1000 config accesses to shut
> > > down a device seems unreasonably high.  The entire config space is
> > > only 4096 bytes, and most devices use a small fraction of that.  If
> > > we're really doing 1000 accesses, it sounds like we're doing something
> > > wrong, like polling without a delay or something.
> > 
> > Every time pci_find_ext_capability is called on a removed device, the
> > kernel will do 481 failed config space accesses trying to find that
> > capability. The kernel used to do that multiple times to find the AER
> > capability under conditions common to surprise removal.
> 
> Right, that's a perfect example.  I'd rather fix issues like this by
> caching the position as you did with AER.  The "removed" bit makes
> these issues "go away" without addressing the underlying problem.
> 
> We might still need a "removed" bit for other reasons, but I want to
> be clear about those reasons, not just throw it in under the general
> "make it go fast" umbrella.
> 
> > But now that we cache the AER position (commit: 66b80809), we've
> > eliminated by far the worst offender. The counts I'm telling you are
> > still referencing the original captured traces showing long tear down
> > times, so it's not up-to-date with the most recent version of the kernel.
> >  
> > > I measured the cost of config reads during enumeration using the TSC
> > > on a 2.8GHz CPU and found the following:
> > > 
> > >   1580 cycles, 0.565 usec (device present)
> > >   1230 cycles, 0.440 usec (empty slot)
> > >   2130 cycles, 0.761 usec (unimplemented function of multi-function device)
> > > 
> > > So 1-2usec does seem the right order of magnitude, and my "empty slot"
> > > error responses are actually *faster* than the "device present" ones,
> > > which is plausible to me because the Downstream Port can generate the
> > > error response immediately without sending a packet down the link.
> > > The "unimplemented function" responses take longer than the "empty
> > > slot", which makes sense because the Downstream Port does have to send
> > > a packet to the device, which then complains because it doesn't
> > > implement that function.
> > > 
> > > Of course, these aren't the same case as yours, where the link used to
> > > be up but is no longer.  Is there some hardware timeout to see if the
> > > link will come back?
> > 
> > Yes, the hardware does not respond immediately under this test, which
> > is considered an error condition. This is a reason why PCIe Device
> > Capabilities 2 Completion Timeout Ranges are recommended to be in the
> > 10ms range.
> 
> And we're apparently still doing a lot of these accesses?  I'm still
> curious about exactly what these are, because it may be that we're
> doing more than necessary.

It's the MSI-x masking that's our next highest contributor. Masking
vectors still requires non-posted commands, and since they're not going
through managed API accessors like config space uses, the removed flag
is needed for checking before doing significant MMIO.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html