Re: [PATCH v5 RESEND] PCI: Set PCI-E Max Payload Size on fabric

Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> · Sat, 25 Jun 2011 08:54:17 +1000

On Fri, 2011-06-24 at 17:42 -0500, Jon Mason wrote:
> On Thu, Jun 23, 2011 at 11:49 PM, Benjamin Herrenschmidt
> <benh@xxxxxxxxxxxxxxxxxxx> wrote:
> > On Thu, 2011-06-23 at 22:54 -0500, Jon Mason wrote:
> >> There is a sizable performance boost for having the largest possible
> >> maximum payload size on each PCI-E device.  However, the maximum payload
> >> size must be uniform on a given PCI-E fabric, and each device, bridge,
> >> and root port can have a different max size.  To find and configure the
> >> optimal MPS settings, one must walk the fabric and determine the largest
> >> MPS available on all the devices on the given fabric.
> >
> > Ok, so a few comments:
> >
> >  - Don't do a pci_walk_bus_self() that takes a "self" bool, that's ugly.
> > If you want common code, then do a __pci_walk_bus() that takes a "self"
> > argument (call it include_self ?) and two small wrappers for
> > pci_walk_bus() and pci_walk_bus_self() (tho arguably the later isn't
> > needed, just use the __ variant directly)
> >
> >  - Thinking more, I don't see the point of that pci_walk_bus_self() ...
> > It will cause a bridge device to be called twice. If all you want is to
> > make sure the toplevel one is called, just do that manually before you
> > call pci_walk_bus_self()
> 
> If they are nested, then there could be many layers of PCIE switches
> that need to be configured with the MPS.  I'll look into this a bit
> more, but it is necessary.

No, my point is that without your change it should already be calling
the callback for intermediary bridges/switches. Or am I missing
something ? The only "missing" case is the top level from what I can
tell from reading the code.

> >  - Does pci_walk_bus() provide any guarantee in term of API as to the
> > order in which it walks the devices ? IE. Parent first, then children ?
> 
> I see no guarantee, but thought it would make you happy to see the
> patch doing it similar to the way the patch you sent out was doing
> things.
>
> > That's how it's implemented today but are we certain that will remain ?
> > It's not documented as such.... Your "forcemax" case as it is
> > implemented will work only as long as this ordering is respected.
> 
> That is an excellent point.  Would a comment above the function
> warning that changing it will break things, or should I create a
> simpler function, similar to what I had before, that would guarantee
> it?

Either way, what does Jesse prefers ?

> >  - I would like a way to specify the MPS of the host bridge, it may
> > not be the max of the RC P2P bridge (it -generally is- but I'd like a
> > way to override that easily).
> 
> I could do another boot arg for this.

Please no... not boot args. Those suck. Besides things can be different
per host bridge, especially on !x86 platforms. Best is pcibios_* helpers
(that take the bus as an argument) which the arch can populate for these
things.

> >  - I think we need to handle MRRS at the same time. We need MRRS to be
> > clamped by MPS, don't we ? In addition to being clamped top-down.
> 
> You are 100% correct.  I was working on a follow-on patch to do it,
> but I'll just put it into here.

Ok.

> >  - For MRRS we -must- have the arch provide the max of the host bridge
> 
> I'm not sure if this is a must, unless there is some arch requirement.

Well, RC doesn't expose it's capabilities. So you basically don't know
what your PHB MRSS limit is unless the arch code tells you. On x86, you
probably just rely on whatever the BIOS sets up, but a lot of other
archs out there need to be able to setup the PCIe hierarchy from
scratch. That's why my patch has this pcie_settings structure that is
attached to a host bridge which contains the arch provided MPS and MRSS
for that bridge.

> IIRC, MRRS is more for bus fairness.  In theory, we would want it to
> be uniform across all of the system to prevent bus hogging by a device
> with a larger MRRS.

It's also about correctness. I've seen cases of devices not working
properly due to MRSS violations.

>   Of course, doing so will have a negative impact
> on performance.  Perhaps we should have 2 settings:
> 1) performance - set the MPS to the parent MPS and the MRRS to the device MPSS
> 2) safety - set the MPS uniformly to the smallest on the fabric and
> the MRRS to the same value
> Both of these could be overridden by arch specified value.
> 
> >  - pcie_mps_forcemax -> pcibios_use_max_mps(bus) or something like that.
> > I want it to be a function so I can make it per-platform or per host
> > bridge even if I have to. Might be a good spot to add an argument for
> > the bridge to return a platform maximum to clamp everything else as
> > well.
> >
> >  - Finally, should we make this "forcemax" the default behaviour or are
> > we afraid we'll break some nvidia gunk ? (I doubt it will break, that
> > would imply the two cards come up with small & different MPS in
> 
> I'd be fine with that, as long as I can have a warning message telling
> them to use the more secure way of doing it if they have issues.
> Sound reasonable?
> 
> > practice). Do we want to keep an API for drivers who want to clamp
> > things between two devices who want to do peer to peer ?
> 
> I think that is a larger issue that I was wanting to address in this
> patch.  I can do this as a follow on patch if there is anyone out
> there that is interested in having a peer-peer framework.
> 
> Thanks so much for reviewing the patch!

Cheers,
Ben.

> Thanks,
> Jon
> 
> > Cheers,
> > Ben.
> >
> >

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html