Re: [PATCH] PCI: Remove MRRS modification from MPS setting code

Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> · Wed, 07 Sep 2011 07:58:28 -0300

On Wed, 2011-09-07 at 12:37 +0200, Rolf Eike Beer wrote:
> > On Wed, 2011-09-07 at 06:30 -0300, Benjamin Herrenschmidt wrote:
> >
> >> Unfortunately, I didn't manage to get a good TLP capture of the problem
> >> packets in AER. But basically what happens is:
> >>
> >>  - Host bridge has a large MPS (For example 4096)
> 
> Out of curiosity: what is this for a board? The only thing we ever found
> for a reasonable price that has more than 128 byte here is the Intel
> X58/Tylersburg (and an older NVidia one). This is not really my business
> anymore but I guess I could make some people happy if I tell them what to
> look for.

IBM POWER7 server :-)

> >>  - Device has a smaller MPS (for example 128)
> >>  - Device has a large MRRS (for example 512)
> >
> > Just double checked on the actual machine. The device has a MPS of 256
> > and the bridge can go up to 4096. We were letting it up and observed the
> > problem with an MRRS of 512 (apparently the power-on default of that
> > adapter).
> >
> > So either we clamp the bridge to 256 and penalize everybody, or we clamp
> > the e1000's MRRS to 256 and things work.
> 
> We need to change the MPS of the bridge anyway as it could send e.g. a
> write request of 4k otherwise. 

No. It couldn't in our case (and in most implementations) simply because
write requests from the bridge are triggered by MMIO operations that
simply cannot gather that much. Granted, it requires platform specific
knowledge to be able to make that assumption (which is why I originally
wanted that "fast" mode to be selected specifically by the
architecture).

In our case, we -know- our bridges will never generate a request longer
than 128 bytes. In fact, it can probably only happen if you have
something like a fancy DMA controller on the CPU side of the bridge (tho
Intels tend to have that nowadays).

> Which is completely orthogonal to the MRRS
> and would cause the same breakage. And as far as I understand what the
> patches do is exactly this change for exactly this reason: avoid too large
> packets hitting the device. The MRRS is only for things that were
> originally requested by the target device, but it is by far not the only
> way such packets may happen. Maybe it is the most likely way, but nothing
> more.
> 
> But it would still be an interesting question to get a list of devices
> broken when the MRRS is changed. And to kick the vendors hard to fix that
> mess.

Well, I would definitely recommend that the default unless changed by
the architecture code is the "safe" mode, in which case the bridge is
clamped (and effectively all devices in the hierarchy below the host
bridge are clamped to the lowest common MPS denominator).

On Power server, I'm happy to switch to the "fast" approach by default
which is to leave the upstream bridges at a higher MPS since that will
gain us a significant performance boost with some adapters, and have the
"safe" mode remain a kernel command line option just in case.

Without that, anything like a hotplug enclosure for example would have
to have everything permanently clamped to 128 which sucks.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html