Re: [PATCH mlx5-next 11/16] net/mlx5: Add VDPA priority to NIC RX namespace

Jason Gunthorpe <jgg@xxxxxxxxxx> · Wed, 25 Nov 2020 17:22:24 -0400

On Wed, Nov 25, 2020 at 10:54:22AM -0800, Jakub Kicinski wrote:

> > RDMA covers a wide range of accelerated networking these days.. Where
> > else are you going to put this stuff in the kernel?
> 
> IDK what else you got in there :) It's probably a case by case answer.

Hmm, yes, it seems endless sometimes :(

> IMHO even using libibverbs is no strong reason for things to fall under
> RDMA exclusively. Client drivers of virtio don't get silently funneled
> through a separate tree just because they use a certain spec.

I'm not sure I understand this, libibverbs is the user library to
interface with the kernel RDMA subsystem. I don't care what apps
people build on top of it, it doesn't matter to me that netdev and
DPDK have some kind of feud.

> > > I'm sure if you start doing crypto over ibverbs crypto people will want
> > > to have a look.  
> > 
> > Well, RDMA has crypto transforms for a few years now too. 
> 
> Are you talking about RDMA traffic being encrypted? That's a different
> case.

That too, but in general, anything netdev can do can be done via RDMA
in userspace. So all the kTLS and IPSEC xfrm HW offloads mlx5 supports
are all available in userspace too.

> > Part of the point of the subsystem split was to end the fighting that
> > started all of it. It was very clear during the whole iWarp and TCP
> > Offload Engine buisness in the mid 2000's that netdev wanted nothing
> > to do with the accelerator world.
> 
> I was in middle school at the time, not sure what exactly went down :)

Ah, it was quite the thing. Microsoft and Co were heavilly pushing TOE
technology (Microsoft Chimney!) as the next most certain thing and I
recall DaveM&co was completely against it in Linux.

I will admit at the time I was doubtful, but in hindsight this was the
correct choice. netdev would not look like it does today if it had
been shackled by the HW implementations of the day. Instead all this
HW stuff ended up largely in RDMA and some in block with the iSCSI
mania of old. It is quite evident to me the mess being tied to HW has
caused to a SW ecosystem. DRM and RDMA both have a very similiar kind
of suffering due to this.

However - over the last 20 years it has been steadfast that there is
*always* a compelling reason for certain applications to use something
from the accelerator side. It is not for everyone, but the specialized
applications that need it, *really need it*.

For instance, it is the difference between being able to get a COVID
simulation result in a few week vs .. well.. never.

> But I'm going by common sense here. Perhaps there was an agreement
> I'm not aware of?

The resolution to the argument above was to split them in Linux.  Thus
what logically is networking was split up in the kernel between netdev
and the accelerator subsystems (iscsi, rdma, and so on).

The general notion is netdev doesn't have to accomodate anything an
accelerator does. If you choose to run them then you do not get to
complain that your ethtool counters are wrong, your routing tables
and tc don't work, firewalling doesn't work. Etc.

That is all broken by design.

In turn, the accelerators do their own thing, tap the traffic before
it hits netdev and so on. netdev does not care what goes on over there
and is not responsible.

I would say this is the basic unspoken agreement of the last 15 years.

Both have a right to exist in Linux. Both have a right to use the
physical ethernet port.

> > So why would netdev need sign off on any accelerator stuff?
> 
> I'm not sure why you keep saying accelerators!
> 
> What is accelerated in raw Ethernet frame access??

The nature of the traffic is not relavent.

It goes through RDMA, it is accelerator traffic (vs netdev traffic,
which goes to netdev). Even if you want to be pedantic, in the raw
ethernet area there is lots of HW special accelerated stuff going
on. Mellanox has some really neat hard realtime networking technology
that works on raw ethernet packets, for instance.

And of course raw ethernet is a fraction of what RDMA covers. iWarp
and RoCE are much more like you might imagine when you hear the word
accelerator.

> > Do you want to start co-operating now? I'm willing to talk about how
> > to do that.
> 
> IDK how that's even in question. I always try to bump all RDMA-looking
> stuff to linux-rdma when it's not CCed there. That's the bare minimum
> of cooperation I'd expect from anyone.

I mean co-operate in the sense of defining a scheme where the two
worlds are not completely seperated and isolated.

> > > And our policy on DPDK is pretty widely known.  
> > 
> > I honestly have no idea on the netdev DPDK policy,
> > 
> > I'm maintaining the RDMA subsystem not DPDK :)
> 
> That's what I thought, but turns out DPDK is your important user.

Nonsense.

I don't have stats but the majority of people I work with using RDMA
are not using DPDK. DPDK serves two somewhat niche markets of NFV and
certain hyperscalers - RDMA covers the entire scientific computing
community and a big swath of the classic "Big Iron" enterprise stuff,
like databases and storage.

> Now IIUC you're tapping traffic for DPDK/raw QPs _before_ all switching
> happens in the NIC? That breaks the switchdev model. We're back to
> per-vendor magic.

No, as I explained before, the switchdev completely contains the SF/VF
and all applications running on a mlx5_core are trapped by it. This
includes netdev, RDMA and VDPA.

> And why do you need a separate VDPA table in the first place?
> Forwarding to a VDPA device has different semantics than forwarding to
> any other VF/SF?

The VDPA table is not switchdev. Go back to my overly long email about
VDPA, here we are talking about the "selector" that chooses which
subsystem the traffic will go to. The selector is after switchdev but
before netdev, VDPA, RDMA.

Each accelerator subsystem gets a table. RDMA, VDPA, and netdev all
get one. It is some part of the HW to make the selectoring work.

Jason