Re: [PATCH net-next 00/13] Add mlx5 subfunction support

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 19 Nov 2020 10:00:17 -0400

On Wed, Nov 18, 2020 at 10:22:51PM -0800, Saeed Mahameed wrote:
> > I think the biggest missing piece in my understanding is what's the
> > technical difference between an SF and a VDPA device.
> 
> Same difference as between a VF and netdev.
> SF == VF, so a full HW function.
> VDPA/RDMA/netdev/SCSI/nvme/etc.. are just interfaces (ULPs) sharing the
> same functions as always been, nothing new about this.

All the implementation details are very different, but this white
paper from Intel goes into some detail the basic elements and rational
for the SF concept:

https://software.intel.com/content/dam/develop/public/us/en/documents/intel-scalable-io-virtualization-technical-specification.pdf

What we are calling a sub-function here is a close cousin to what
Intel calls an Assignable Device Interface. I expect to see other
drivers following this general pattern eventually.

A SF will eventually be assignable to a VM and the VM won't be able to
tell the difference between a VF or SF providing the assignable PCI
resources.

VDPA is also assignable to a guest, but the key difference between
mlx5's SF and VDPA is what guest driver binds to the virtual PCI
function. For a SF the guest will bind mlx5_core, for VDPA the guest
will bind virtio-net.

So, the driver stack for a VM using VDPA might be

 Physical device [pci] -> mlx5_core -> [aux] -> SF -> [aux] ->  mlx5_core -> [aux] -> mlx5_vdpa -> QEMU -> |VM| -> [pci] -> virtio_net

When Parav is talking about creating VDPA devices he means attaching
the VDPA accelerator subsystem to a mlx5_core, where ever that
mlx5_core might be attached to.

To your other remark:

> > What are you NAK'ing?
> Spawning multiple netdevs from one device by slicing up its queues.

This is a bit vauge. In SRIOV a device spawns multiple netdevs for a
physical port by "slicing up its physical queues" - where do you see
the cross over between VMDq (bad) and SRIOV (ok)?

I thought the issue with VMDq was more on the horrid management to
configure the traffic splitting, not the actual splitting itself?

In classic SRIOV the traffic is split by a simple non-configurable HW
switch based on MAC address of the VF.

mlx5 already has the extended version of that idea, we can run in
switchdev mode and use switchdev to configure the HW switch. Now
configurable switchdev rules split the traffic for VFs.

This SF step replaces the VF in the above, but everything else is the
same. The switchdev still splits the traffic, it still ends up in same
nested netdev queue structure & RSS a VF/PF would use, etc, etc. No
queues are "stolen" to create the nested netdev.

>From the driver perspective there is no significant difference between
sticking a netdev on a mlx5 VF or sticking a netdev on a mlx5 SF. A SF
netdev is not going in and doing deep surgery to the PF netdev to
steal queues or something.

Both VF and SF will be eventually assignable to guests, both can
support all the accelerator subsystems - VDPA, RDMA, etc. Both can
support netdev.

Compared to VMDq, I think it is really no comparison. SF/ADI is an
evolution of a SRIOV VF from something PCI-SGI controlled to something
device specific and lighter weight.

SF/ADI come with a architectural security boundary suitable for
assignment to an untrusted guest. It is not just a jumble of queues.

VMDq is .. not that.

Actually it has been one of the open debates in the virtualization
userspace world. The approach to use switchdev to control the traffic
splitting to VMs is elegant but many drivers are are not following
this design. :(

Finally, in the mlx5 model VDPA is just an "application". It asks the
device to create a 'RDMA' raw ethernet packet QP that is uses rings
formed in the virtio-net specification. We can create it in the kernel
using mlx5_vdpa, and we can create it in userspace through the RDMA
subsystem. Like any "RDMA" application it is contained by the security
boundary of the PF/VF/SF the mlx5_core is running on.

Jason