Re: [PATCH net-next 00/13] Add mlx5 subfunction support

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 20 Nov 2020 12:16:59 -0400

On Thu, Nov 19, 2020 at 07:35:26PM -0800, Jakub Kicinski wrote:
> On Thu, 19 Nov 2020 10:00:17 -0400 Jason Gunthorpe wrote:
> > Finally, in the mlx5 model VDPA is just an "application". It asks the
> > device to create a 'RDMA' raw ethernet packet QP that is uses rings
> > formed in the virtio-net specification. We can create it in the kernel
> > using mlx5_vdpa, and we can create it in userspace through the RDMA
> > subsystem. Like any "RDMA" application it is contained by the security
> > boundary of the PF/VF/SF the mlx5_core is running on.
> 
> Thanks for the write up!

No problem!

> The part that's blurry to me is VDPA.

Okay, I think I see where the gap is, I'm going to elaborate below so
we are clear.

> I was under the impression that for VDPA the device is supposed to
> support native virtio 2.0 (or whatever the "HW friendly" spec was).

I think VDPA covers a wide range of things.

The basic idea is starting with the all SW virtio-net implementation
we can move parts to HW. Each implementation will probably be a little
different here. The kernel vdpa subsystem is a toolbox to mix the
required emulation and HW capability to build a virtio-net PCI
interface.

The most key question to ask of any VDPA design is "what does the VDPA
FW do with the packet once the HW accelerator has parsed the
virtio-net descriptor?".

The VDPA world has refused to agree on this due to vendor squabbling,
but mlx5 has a clear answer:

 VDPA Tx generates an ethernet packet and sends it out the SF/VF port
 through a tunnel to the representor and then on to the switchdev.

Other VDPA designs have a different answer!!

This concept is so innate to how Mellanox views the world it is not
surprising me that the cover letters and patch descriptions don't
belabor this point much :)

I'm going to deep dive through this answer below. I think you'll see
this is the most sane and coherent architecture with the tools
available in netdev.. Mellanox thinks the VDPA world should
standardize on this design so we can have a standard control plane.

> You're saying it's a client application like any other - do I understand
> it right that the hypervisor driver will be translating descriptors
> between virtio and device-native then?

No, the hypervisor creates a QP and tells the HW that this QP's
descriptor format follows virtio-net. The QP processes those
descriptors in HW and generates ethernet packets.

A "client application like any other" means that the ethernet packets
VDPA forms are identical to the ones netdev or RDMA forms. They are
all delivered into the tunnel on the SF/VF to the representor and on
to the switch. See below

> The vdpa parent is in the hypervisor correct?
> 
> Can a VDPA device have multiple children of the same type?

I'm not sure parent/child are good words here.

The VDPA emulation runs in the hypervisor, and the virtio-net netdev
driver runs in the guest. The VDPA is attached to a switchdev port and
representor tunnel by virtue of its QPs being created under a SF/VF.

If we imagine a virtio-rdma, then you might have a SF/VF hosting both
VDPA and VDPA-RDMA which emulate two PCI devices assigned to a
VM. Both of these peer virtio's would generate ethernet packets for TX
on the SF/VF port into the tunnel through the represntor and to the
switch.

> Why do we have a representor for a SF, if the interface is actually VDPA?
> Block and net traffic can't reasonably be treated the same by the
> switch.

I think you are focusing on queues, the architecture at PF/SF/VF is
not queue based, it is packet based.

At the physical mlx5 the netdev has a switchdev. On that switch I can
create a *switch port*.

The switch port is composed of a representor and a SF/VF. They form a
tunnel for packets.

The representor is the hypervisor side of the tunnel and contains all
packets coming out of and into the SF/VF.

The SF/VF is the guest side of the tunnel and has a full NIC.

The SF/VF can be:
 - Used in the same OS as the switch
 - Assigned to a guest VM as a PCI device
 - Assigned to another processor in the SmartNIC case.

In all cases if I use a queue on a SF/VF to generate an ethernet
packet then that packet *always* goes into the tunnel to the
representor and goes into a switch. It is always contained by any
rules on the switch side. If the switch is set so the representor is
VLAN tagged then a queue on a SF/VF *cannot* escape the VLAN tag.

Similarly SF/VF cannot Rx any packets that are not sent into the
tunnel, meaning the switch controls what packets go into the
representor, through the tunnel and to the SF.

Yes, block and net traffic are all reduced to ethernet packets, sent
through the tunnel to the representor and treated by the switch. It is
no different than a physical switch. If there is to be some net/block
difference it has to be represented in the ethernet packets, eg with
vlan or something.

This is the fundamental security boundary of the architecture. The
SF/VF is a security domain and the only exchange of information from
that security domain to the hypervisor security domain is the tunnel
to the representor. The exchange across the boundary is only *packets*
not queues.

Essentially it exactly models the physical world. If I phyically plug
in a NIC to a switch then the "representor" is the switch port in the
physical switch OS and the "SF/VF" is the NIC in the server.

The switch OS does not know or care what the NIC is doing. It does not
know or care if the NIC is doing VDPA, or if the packets are "block"
or "net" - they are all just packets by the time it gets to switching.

> Also I'm confused how block device can bind to mlx5_core - in that case
> I'm assuming the QP is bound 1:1 with a QP on the SmartNIC side, and
> that QP is plugged into an appropriate backend?

Every mlx5_core is a full multi-queue instance. It can have a huge
number of queues with no problems. Do not focus on the
queues. *queues* are irrelevant here.

Queues always have two ends. In this model one end is at the CPU and
the other is just ethernet packets. The purpose of the queue is to
convert CPU stuff into ethernet packets and vice versa. A mlx5 device
has a wide range of accelerators that can do all sorts of
transformations between CPU and packets built into the queues.

A queue can only be attached to a single mlx5_core, meaning all the
ethernet packets the queue sources/sinks must come from the PF/SF/VF
port. For SF/VF this port is connected to a tunnel to a representor to
the switch. Thus every queue has its packet side connected to the
switch.

However, the *queue* is an opaque detail of how the ethernet packets
are created from CPU data.

It doesn't matter if the queue is running VDPA, RDMA, netdev, or block
traffic - all of these things inherently result in ethernet packets,
and the hypervisor can't tell how the packet was created.

The architecture is *not* like virtio. virtio queues are individual
tunnels between hypervisor and guest.

This is the key detail: A VDPA queue is *not a tunnel*. It is a engine
to covert CPU data in virtio-net format to ethernet packets and
deliver those packet to the SF/VF end of the tunnel to the representor
and then to the switch. The tunnel is the SF/VF and representor
pairing, NOT the VDPA queue.

Looking at the logical life of a Tx packet from a VM doing VDPA:
 - VM's netdev builds the skb and writes a vitio-net formed descriptor
   to a send qeuue
 - VM triggers a doorbell via write to a BAR. In mlx5 this write goes
   to the device - qemu mmaps part of the device BAR to the guest
 - The HW begins processing a queue. The queue is in virtio-net format
   so it fetches the descriptor and now has the skb data
 - The HW forms the skb into an ethernet packet and delivers it to the
   representor through the tunnel, which immediately sends it to the
   HW switch. The VDPA QP in the SF/VF is now done.

 - In the switch the HW determines the packet is an exception. It
   applies RSS rules/etc and dynamically identifies on a per-packet
   basis what hypervisor queue the packet should be delivered to.
   This queue is in the hypervisor, and is in mlx5 native format.
 - The choosen hypervisor queue recives this packet and begins
   processing. It gets a receive buffer and writes the packet,
   triggers an interrupts. This queue is now done.

 - hypervisor netdev now has the packet. It does the exception path
   in netdev and puts the SKB back on another queue for TX to the
   physical port. This queue is in mlx5 native format, the packet goes
   to the physical port.

It traversed three queues. The HW dynamically selected the hypervisor
queue the VDPA packet is delivered to based *entirely* on switch
rules. The originating queue only informs the switch of what SF/VF
(and thus switch port) generated the packet.

At no point does the hypervisor know the packet originated from a VDPA
QP.

The RX side the similar, each PF/SF/VF port has a selector that
chooses which queue each packet goes to. That chooses how the packet
is converted to CPU. Each PF/SF/VF can have a huge number of
selectors, and SF/VF source their packets from the logical tunnel
attached to a representor which receives packets from the switch.

The selector is how the cross subsystem sharing of the ethernet port
works, regardless of PF/SF/VF.

Again the hypervisor side has *no idea* what queue the packet will be
selected to when it delivers the packet to the representor side of the
tunnel.

Jason