Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Tue, 31 Jul 2018 06:06:49 -0500

On Mon, Jul 30, 2018 at 08:19:50PM -0700, Alexander Duyck wrote:
> On Mon, Jul 30, 2018 at 7:33 PM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > On Mon, Jul 30, 2018 at 08:02:48AM -0700, Alexander Duyck wrote:
> >> On Mon, Jul 30, 2018 at 7:07 AM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >> > On Sun, Jul 29, 2018 at 03:00:28PM -0700, Alexander Duyck wrote:
> >> >> On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh <moshes20.il@xxxxxxxxx> wrote:
> >> >> > On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >> >> >> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> >> >> >> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko <jiri@xxxxxxxxxxx> wrote:
> >> >> >> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicinski@xxxxxxxxxxxxx
> >> >> >> > > wrote:
> >> >> >> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> >> >> >> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> >> >> >> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> >> >> >> > >>> >>>> The devlink params haven't been upstream even for a full cycle
> >> >> >> > >>> >>>> and
> >> >> >> > >>> >>>> already you guys are starting to use them to configure standard
> >> >> >> > >>> >>>> features like queuing.
> >> >> >> > >>> >>>
> >> >> >> > >>> >>> We developed the devlink params in order to support non-standard
> >> >> >> > >>> >>> configuration only. And for non-standard, there are generic and
> >> >> >> > >>> >>> vendor
> >> >> >> > >>> >>> specific options.
> >> >> >> > >>> >>
> >> >> >> > >>> >> I thought it was developed for performing non-standard and
> >> >> >> > >>> >> possibly
> >> >> >> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> >> >> >> > >>> >> for
> >> >> >> > >>> >> examples of well justified generic options for which we have no
> >> >> >> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> >> >> >> > >>> >> if you
> >> >> >> > >>> >> ask me, too.
> >> >> >> > >>> >>
> >> >> >> > >>> >> Configuring queuing has an API.  The question is it acceptable to
> >> >> >> > >>> >> enter
> >> >> >> > >>> >> into the risky territory of controlling offloads via devlink
> >> >> >> > >>> >> parameters
> >> >> >> > >>> >> or would we rather make vendors take the time and effort to model
> >> >> >> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
> >> >> >> > >>> >> APIs
> >> >> >> > >>> >> perfectly.
> >> >> >> > >>> >
> >> >> >> > >>> > I understand what you meant here, I would like to highlight that
> >> >> >> > >>> > this
> >> >> >> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> >> >> >> > >>> > The vendor specific configuration suggested here is to handle a
> >> >> >> > >>> > congestion
> >> >> >> > >>> > state in Multi Host environment (which includes PF and multiple
> >> >> >> > >>> > VFs per
> >> >> >> > >>> > host), where one host is not aware to the other hosts, and each is
> >> >> >> > >>> > running
> >> >> >> > >>> > on its own pci/driver. It is a device working mode configuration.
> >> >> >> > >>> >
> >> >> >> > >>> > This  couldn't fit into any existing API, thus creating this
> >> >> >> > >>> > vendor specific
> >> >> >> > >>> > unique API is needed.
> >> >> >> > >>>
> >> >> >> > >>> If we are just going to start creating devlink interfaces in for
> >> >> >> > >>> every
> >> >> >> > >>> one-off option a device wants to add why did we even bother with
> >> >> >> > >>> trying to prevent drivers from using sysfs? This just feels like we
> >> >> >> > >>> are back to the same arguments we had back in the day with it.
> >> >> >> > >>>
> >> >> >> > >>> I feel like the bigger question here is if devlink is how we are
> >> >> >> > >>> going
> >> >> >> > >>> to deal with all PCIe related features going forward, or should we
> >> >> >> > >>> start looking at creating a new interface/tool for PCI/PCIe related
> >> >> >> > >>> features? My concern is that we have already had features such as
> >> >> >> > >>> DMA
> >> >> >> > >>> Coalescing that didn't really fit into anything and now we are
> >> >> >> > >>> starting to see other things related to DMA and PCIe bus credits.
> >> >> >> > >>> I'm
> >> >> >> > >>> wondering if we shouldn't start looking at a tool/interface to
> >> >> >> > >>> configure all the PCIe related features such as interrupts, error
> >> >> >> > >>> reporting, DMA configuration, power management, etc. Maybe we could
> >> >> >> > >>> even look at sharing it across subsystems and include things like
> >> >> >> > >>> storage, graphics, and other subsystems in the conversation.
> >> >> >> > >>
> >> >> >> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
> >> >> >> > >> need
> >> >> >> > >>to build up an API.  Sharing it across subsystems would be very cool!
> >> >> >>
> >> >> >> I read the thread (starting at [1], for anybody else coming in late)
> >> >> >> and I see this has something to do with "configuring outbound PCIe
> >> >> >> buffers", but I haven't seen the connection to PCIe protocol or
> >> >> >> features, i.e., I can't connect this to anything in the PCIe spec.
> >> >> >>
> >> >> >> Can somebody help me understand how the PCI core is relevant?  If
> >> >> >> there's some connection with a feature defined by PCIe, or if it
> >> >> >> affects the PCIe transaction protocol somehow, I'm definitely
> >> >> >> interested in this.  But if this only affects the data transferred
> >> >> >> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
> >> >> >> sure why the PCI core should care.
> >> >> >
> >> >> > As you wrote, this is not a PCIe feature  or affects the PCIe transaction
> >> >> > protocol.
> >> >> >
> >> >> > Actually, due to hardware limitation in current device, we have enabled a
> >> >> > workaround in hardware.
> >> >> >
> >> >> > This mode is proprietary and not relevant to other PCIe devices, thus is set
> >> >> > using driver-specific parameter in devlink
> >> >>
> >> >> Essentially what this feature is doing is communicating the need for
> >> >> PCIe back-pressure to the network fabric. So as the buffers on the
> >> >> device start to fill because the device isn't able to get back PCIe
> >> >> credits fast enough it will then start to send congestion
> >> >> notifications to the network stack itself if I understand this
> >> >> correctly.
> >> >
> >> > This sounds like a hook that allows the device to tell its driver
> >> > about PCIe flow control credits, and the driver can pass that on to
> >> > the network stack.  IIUC, that would be a device-specific feature
> >> > outside the scope of the PCI core.
> >> >
> >> >> For now there are no major conflicts, but when we start getting into
> >> >> stuff like PCIe DMA coalescing, and on a more general basis just PCIe
> >> >> active state power management that is going to start making things
> >> >> more complicated going forward.
> >> >
> >> > We do support ASPM already in the PCI core, and we do have the
> >> > pci_disable_link_state() interface, which is currently the only way
> >> > drivers can influence it.  There are several drivers that do their own
> >> > ASPM configuration, but this is not safe because it's not coordinated
> >> > with what the PCI core does.  If/when drivers need more control, we
> >> > should enhance the PCI core interfaces.
> >>
> >> This is kind of what I was getting at. It would be useful to have an
> >> interface of some sort so that drivers get notified when a user is
> >> making changes to configuration space and I don't know if anything
> >> like that exists now.
> >
> > You mean something like this?
> >
> >   - driver registers a callback with PCI core
> >   - user runs setpci, which writes PCI config space via sysfs
> >   - kernel hook in pci_write_config() notices write and calls driver
> >     callback
> >   - driver callback receives config address and data written
> >   - driver parses PCI capability lists to identify register
> >
> > Nothing like that exists today, and this is not what I had in mind by
> > "enhance the PCI core interfaces".  I'm not sure what the utility of
> > this is (but I'm not a networking guy by any means).
> 
> Well in general I have been wondering if setpci is really the cleanest
> way to do any of this. I have found it can be a fast way to really
> mess things up. For example using setpci to trigger an FLR is a fast
> way to cripple an interface. I recall using that approach when testing
> the fm10k driver to deal with surprise resets.
> 
> The problem is setpci is REALLY powerful. It lets you change a ton
> that can impact the driver, and many drivers just read the config
> space at the probe function and assume it is static (which in most
> cases it is). That is why I was thinking as a small step it might be
> useful to have the notifications delivered to the driver if it is
> registered on the interface so it could prepare or clean-up after a
> change to the PCI configuration space.

Yep, setpci is very powerful.  I doubt it's worth having drivers try
to react, just because setpci can do completely arbitrary things, many
of which are not recoverable even in principle.  But I do think we
should make it taint the kernel so we have a clue when things go
wrong.

> > I think it's a bad idea for drivers to directly write config space.
> > It would be much better to provide a PCI core interface so we can
> > implement things once and coordinate things that need to be
> > coordinated.
> 
> I agree with that.
> 
> >> > I don't know what PCIe DMA coalescing means, so I can't comment on
> >> > that.
> >>
> >> There are devices, specifically network devices, that will hold off on
> >> switching between either L0s or L1 and L0 by deferring DMA operations.
> >> Basically the idea is supposed to be to hold off bringing the link up
> >> for as long as possible in order to maximize power savings for the
> >> ASPM state. This is something that has come up in the past, and I
> >> don't know if there has been any interface determined for how to
> >> handle this sort of configuration. Most of it occurs through MMIO.
> >
> > The device can certainly delay L0s or L1 exit if it wants to.  If
> > there are knobs to control this, e.g., how long it can defer a DMA, it
> > makes sense that they would be device-specific and in MMIO space.  The
> > PCI core can't be involved in that because in general it knows nothing
> > about the contents of MMIO BARs.  Presumably those knobs would work
> > within the framework of ASPM as defined by the PCIe spec, e.g., if we
> > disable ASPM, the knobs would do nothing because there is no L0s or L1
> > at all.
> >
> > That's not to say that device designers couldn't get together and
> > define a common model for such knobs that puts them at well-known
> > offsets in well-known BARs.  All I'm saying is that this sounds like
> > it's currently outside the realm of the PCI specs and the PCI core.
> 
> Some of this was already handled in LTR and OBFF. I think the DMA
> coalescing was meant to work either with those, or optionally on its
> own. As far as being ASPM dependent or not I don't think it really
> cared all that much other than knowing if ASPM was enabled and how
> long it takes for a given device to come out of either of those
> states. The general idea with DMA coalescing was to try and save power
> on the CPU and then possibly the PCIe link if the value was set high
> enough and ASPM was enabled.
> 
> Anyway we have kind of gotten off into a tangent as I was just citing
> something that might end up having an interaction with a feature such
> as notifying the stack that the Rx buffer on a given device has become
> congested.
> 
> >> >> I assume the devices we are talking about supporting this new feature
> >> >> on either don't deal with ASPM or assume a quick turnaround to get out
> >> >> of the lower power states? Otherwise that would definitely cause some
> >> >> back-pressure buildups that would hurt performance.
> >> >
> >> > Devices can communicate the ASPM exit latency they can tolerate via
> >> > the Device Capabilities register (PCIe r4.0, sec 76.5.3.3).  Linux
> >> > should be configuring ASPM to respect those device requirements.
> >>
> >> Right. But my point was something like ASPM will add extra complexity
> >> to a feature such as what has been described here. My concern is that
> >> I don't want us implementing stuff on a per-driver basis that is not
> >> all that unique to the device. I don't really see the feature that was
> >> described above as being something that will stay specific to this one
> >> device for very long, especially if it provides added value. Basically
> >> all it is doing is allowing exposing PCIe congestion management to
> >> upper levels in the network stack. I don't even necessarily see it as
> >> being networking specific as I would imagine there might be other
> >> types of devices that could make use of knowing how many transactions
> >> and such they could process at the same time.
> >
> > It sounds to me like you need a library that can be used by all the
> > drivers that need this functionality.  Unless it's something
> > documented in the PCIe specs so we can rely on a standard way of
> > discovering and configuring things, I don't see how the PCI core can
> > really be involved.
> >
> > Bjorn
> 
> I am kind of thinking that a common library would be a preferred way
> to go, or at least a common interface to share between the drivers. It
> wasn't my intention to imply that the PCI core needed to get involved.
> I was including the linux-pci list as more of a way of checking to get
> the broader audience's input since we were just discussing this on
> netdev.
> 
> My main concern is that those of us on netdev have historically we
> have been dealing with features such as SR-IOV as a networking thing,
> when really it has been a combination of a PCI feature and some
> network switching. I might have ratholed this a bit, as I kind of see
> this topic as something similar.

OK, thanks.  Sounds like we have similar perspectives and I appreciate
being looped in!

Bjorn