Re: [net-next v4 00/15] Add mlx5 subfunction support

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Thu, 17 Dec 2020 10:48:48 -0800

On Wed, Dec 16, 2020 at 4:38 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote:
>
> > It isn't about the association, it is about who is handling the
> > traffic. Going back to the macvlan model what we did is we had a group
> > of rings on the device that would automatically forward unicast
> > packets to the macvlan interface and would be reserved for
> > transmitting packets from the macvlan interface. We took care of
> > multicast and broadcast replication in software.
>
> Okay, maybe I'm starting to see where you are coming from.
>
> First, I think some clarity here, as I see it the devlink
> infrastructure is all about creating the auxdevice for a switchdev
> port.
>
> What goes into that auxdevice is *completely* up to the driver. mlx5
> is doing a SF which == VF, but that is not a requirement of the design
> at all.
>
> If an Intel driver wants to put a queue block into the aux device and
> that is != VF, it is just fine.
>
> The Intel netdev that binds to the auxdevice can transform the queue
> block and specific switchdev config into a netdev identical to
> accelerated macvlan. Nothing about the breaks the switchdev model.

Just to clarify I am not with Intel, nor do I plan to work on any
Intel drivers related to this.

My concern has more to do with how this is being plumbed and the fact
that the basic architecture is somewhat limiting.

> Essentially think of it as generalizing the acceleration plugin for a
> netdev. Instead of making something specific to limited macvlan, the
> driver gets to provide exactly the structure that matches its HW to
> provide the netdev as the user side of the switchdev port. I see no
> limitation here so long as the switchdev model for controlling traffic
> is followed.

I see plenty. The problem is it just sets up more vendor lock-in and
features that have to be thrown away when you have to settle for
least-common denominator in order to maintain functionality across
vendors.

> Let me segue into a short story from RDMA.. We've had a netdev called
> IPoIB for a long time. It is actually kind of similar to this general
> thing you are talking about, in that there is a programming layer
> under the IPOIB netdev called RDMA verbs that generalizes the actual
> HW. Over the years this became more complicated because every new
> netdev offloaded needed mirroring into the RDMA verbs general
> API. TSO, GSO, checksum offload, endlessly onwards. It became quite
> dumb in the end. We gave up and said the HW driver should directly
> implement netdev. Implementing a middle API layer makes zero sense
> when netdev is already perfectly suited to implement ontop of
> HW. Removing SW layers caused performance to go up something like
> 2x.
>
> The hard earned lesson I take from that is don't put software layers
> between a struct net_device and the actual HW. The closest coupling is
> really the best thing. Provide libary code in the kernel to help
> drivers implement common patterns when making their netdevs, do not
> provide wrapper netdevs around drivers.
>
> IMHO the approach of macvlan accleration made some sense in 2013, but
> today I would say it is mashing unrelated layers together and
> polluting what should be a pure SW implementation with HW hooks.

I disagree here. In my mind a design where two interfaces, which both
exist in the kernel, have to go to hardware in order to communicate is
very limiting. The main thing I am wanting to see is the option of
being able to pass traffic directly between the switchdev and the SF
without the need to touch the hardware.

An easy example of such traffic that would likely benefit from this is
multicast/broadcast traffic. Instead of having to process each and
every broadcast packet in hardware you could very easily process it at
the switchdev and then directly hand it off from the switchdev to the
SF in this case instead of having to send it to hardware for each
switchdev instance.

> I see from the mailing list comments this was done because creating a
> device specific netdev via 'ip link add' was rightly rejected. However
> here we *can* create a device specific vmdq *auxdevice*.  This is OK
> because the netdev is controlling and containing the aux device via
> switchdev.
>
> So, Intel can get the "VMDQ link type" that was originally desired more
> or less directly, so long as the associated switchdev port controls
> the MAC filter process, not "ip link add".
>
> And if you want to make the vmdq auxdevice into an ADI by user DMA to
> queues, then sure, that model is completely sane too (vs hacking up
> macvlan to expose user queues) - so long as the kernel controls the
> selection of traffic into those queues and follows the switchdev
> model. I would recommend creating a simple RDMA raw ethernet queue
> driver over the aux device for something like this :)

You lost me here, I'm not seeing how RDMA and macvlan are connected.

> > That might be a bad example, I was thinking of the issues we have had
> > with VFs and direct assignment to Qemu based guests in the past.
>
> As described, this is solved by VDPA.
>
> > Essentially what I am getting at is that the setup in the container
> > should be vendor agnostic. The interface exposed shouldn't be specific
> > to any one vendor. So if I want to fire up a container or Mellanox,
> > Broadcom, or some other vendor it shouldn't matter or be visible to
> > the user. They should just see a vendor agnostic subfunction
> > netdevice.
>
> Agree. The agnostic container user interface here is 'struct
> net_device'.

I disagree here. The fact is a mellanox netdev, versus a broadcom
netdev, versus an intel netdev all have a very different look at feel
as the netdev is essentially just the base device you are building
around.

In addition it still doesn't address my concern as called out above
which is the east-west traffic problem.

> > > I have the feeling this stuff you are asking for is already done..
> >
> > The case you are describing has essentially solved it for Qemu
> > virtualization and direct assignment. It still doesn't necessarily
> > solve it for the container case though.
>
> The container case doesn't need solving.

I disagree and that is at the heart where you and I have different
views. I view there being two advantages to having the container case
solved:
1. A standardized set of features that can be provided regardless of vendor
2. Allowing for the case where east-west traffic can avoid having to
touch hardware

> Any scheme I've heard for container live migration, like CRIU,
> essentially hot plugs the entire kernel in/out of a user process. We
> rely on the kernel providing low leakage of the implementation details
> of the struct net_device as part of it's uAPI contract. When CRIU
> swaps the kernel the new kernel can have any implementation of the
> container netdev it wants.

I'm not thinking about migration. I am thinking more about the user
experience. In my mind if I set up a container I shouldn't need to
know which vendor provided the network interface when I set it up. The
problem is most NICs have so many one-off proprietary tweaks needed
that it gets annoying. That is why in my mind it would make much more
sense to have a simple vendor agnostic interface. That is why I would
prefer to avoid the VF model.

> I've never heard of a use case to hot swap the implemention *under* a
> netdev from a container. macvlan can't do this today. If you have a
> use case here, it really has nothing to do with with this series.

Again, the hot-swap isn't necessarily what I am talking about. I am
talking about setting up a config for a set of containers in a
datacenter. What I don't want to do is have to have one set of configs
for an mlx5 SF, another for a broadcom SF, and yet another set for any
other vendors out there. I would much rather have all of that dealt
with within the namespace that is handling the switchdev setup.

In addition, the east-west traffic is the other bit I would like to
see addressed. I am okay excusing this in the case of direct
assignment since the resources for the SF will not be available to the
host. However if the SF will be operating in the same kernel as the
PF/switchev it would make much more sense to enable an east/west
channel which would allow for hardware bypass under certain
circumstances without having to ever leave the kernel.