On Sat, Jun 25, 2011 at 08:54:17AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2011-06-24 at 17:42 -0500, Jon Mason wrote: > > On Thu, Jun 23, 2011 at 11:49 PM, Benjamin Herrenschmidt > > <benh@xxxxxxxxxxxxxxxxxxx> wrote: > > > On Thu, 2011-06-23 at 22:54 -0500, Jon Mason wrote: > > >> There is a sizable performance boost for having the largest possible > > >> maximum payload size on each PCI-E device. However, the maximum payload > > >> size must be uniform on a given PCI-E fabric, and each device, bridge, > > >> and root port can have a different max size. To find and configure the > > >> optimal MPS settings, one must walk the fabric and determine the largest > > >> MPS available on all the devices on the given fabric. > > > > > > Ok, so a few comments: > > > > > > - Don't do a pci_walk_bus_self() that takes a "self" bool, that's ugly. > > > If you want common code, then do a __pci_walk_bus() that takes a "self" > > > argument (call it include_self ?) and two small wrappers for > > > pci_walk_bus() and pci_walk_bus_self() (tho arguably the later isn't > > > needed, just use the __ variant directly) > > > > > > - Thinking more, I don't see the point of that pci_walk_bus_self() ... > > > It will cause a bridge device to be called twice. If all you want is to > > > make sure the toplevel one is called, just do that manually before you > > > call pci_walk_bus_self() > > > > If they are nested, then there could be many layers of PCIE switches > > that need to be configured with the MPS. I'll look into this a bit > > more, but it is necessary. > > No, my point is that without your change it should already be calling > the callback for intermediary bridges/switches. Or am I missing > something ? The only "missing" case is the top level from what I can > tell from reading the code. > > > > - Does pci_walk_bus() provide any guarantee in term of API as to the > > > order in which it walks the devices ? IE. Parent first, then children ? > > > > I see no guarantee, but thought it would make you happy to see the > > patch doing it similar to the way the patch you sent out was doing > > things. > > > > > That's how it's implemented today but are we certain that will remain ? > > > It's not documented as such.... Your "forcemax" case as it is > > > implemented will work only as long as this ordering is respected. > > > > That is an excellent point. Would a comment above the function > > warning that changing it will break things, or should I create a > > simpler function, similar to what I had before, that would guarantee > > it? > > Either way, what does Jesse prefers ? I added a comment above the function to document it. So if it is changed, hopefully all callers would be audited and the comment would be seen. > > > - I would like a way to specify the MPS of the host bridge, it may > > > not be the max of the RC P2P bridge (it -generally is- but I'd like a > > > way to override that easily). > > > > I could do another boot arg for this. > > Please no... not boot args. Those suck. Besides things can be different > per host bridge, especially on !x86 platforms. Best is pcibios_* helpers > (that take the bus as an argument) which the arch can populate for these > things. Ok, no boot arg. It's possible to have any arch code specify the pcie_mpss prior to the buses being walk, which would work around this issue. This might be a bit hacky. If it ends up being too difficult, we can do something similar to what you suggest above. > > > > - I think we need to handle MRRS at the same time. We need MRRS to be > > > clamped by MPS, don't we ? In addition to being clamped top-down. > > > > You are 100% correct. I was working on a follow-on patch to do it, > > but I'll just put it into here. > > Ok. This is implemented in the patch I just sent out. Please look it over. Thanks, Jon > > > > - For MRRS we -must- have the arch provide the max of the host bridge > > > > I'm not sure if this is a must, unless there is some arch requirement. > > Well, RC doesn't expose it's capabilities. So you basically don't know > what your PHB MRSS limit is unless the arch code tells you. On x86, you > probably just rely on whatever the BIOS sets up, but a lot of other > archs out there need to be able to setup the PCIe hierarchy from > scratch. That's why my patch has this pcie_settings structure that is > attached to a host bridge which contains the arch provided MPS and MRSS > for that bridge. > > > IIRC, MRRS is more for bus fairness. In theory, we would want it to > > be uniform across all of the system to prevent bus hogging by a device > > with a larger MRRS. > > It's also about correctness. I've seen cases of devices not working > properly due to MRSS violations. > > > Of course, doing so will have a negative impact > > on performance. Perhaps we should have 2 settings: > > 1) performance - set the MPS to the parent MPS and the MRRS to the device MPSS > > 2) safety - set the MPS uniformly to the smallest on the fabric and > > the MRRS to the same value > > Both of these could be overridden by arch specified value. > > > > > - pcie_mps_forcemax -> pcibios_use_max_mps(bus) or something like that. > > > I want it to be a function so I can make it per-platform or per host > > > bridge even if I have to. Might be a good spot to add an argument for > > > the bridge to return a platform maximum to clamp everything else as > > > well. > > > > > > - Finally, should we make this "forcemax" the default behaviour or are > > > we afraid we'll break some nvidia gunk ? (I doubt it will break, that > > > would imply the two cards come up with small & different MPS in > > > > I'd be fine with that, as long as I can have a warning message telling > > them to use the more secure way of doing it if they have issues. > > Sound reasonable? > > > > > practice). Do we want to keep an API for drivers who want to clamp > > > things between two devices who want to do peer to peer ? > > > > I think that is a larger issue that I was wanting to address in this > > patch. I can do this as a follow on patch if there is anyone out > > there that is interested in having a peer-peer framework. > > > > Thanks so much for reviewing the patch! > > Cheers, > Ben. > > > Thanks, > > Jon > > > > > Cheers, > > > Ben. > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html