Re: [RFC 0/4] pci/sriov: support VFs dynamic addition

"Longpeng (Mike, Cloud Infrastructure Service Product Dept.)" <longpeng2@xxxxxxxxxx> · Tue, 15 Nov 2022 09:38:22 +0800

在 2022/11/14 22:20, Leon Romanovsky 写道:
On Mon, Nov 14, 2022 at 10:06:49PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:

在 2022/11/14 21:09, Leon Romanovsky 写道:
On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:

在 2022/11/14 15:04, Leon Romanovsky 写道:
On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
Hi leon,

在 2022/11/12 0:39, Leon Romanovsky 写道:
On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
From: Longpeng <longpeng2@xxxxxxxxxx>

We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
this operation needs to spend lots of time if there has a large amount of VFs.
For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
concurrently would cost about 200-250ms. However most of them are not need to be
used at the moment, so we can enable SRIOV first but add VFs on demand.

It is unclear what took 200-250ms, is it physical VF creation or bind of
the driver to these VFs?

It is neither. In our test, we already created physical VFs before, so we
skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
probes PF, it just returns an error if the function is VF.

It means that you didn't try sriov_drivers_autoprobe. Once it is set to
true, It won't even try to probe VFs.

The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
long procedure. Each step costs only a little, but the total cost is not
acceptable in some time-sensitive cases.

This is also cryptic to me. In standard SR-IOV deployment, all VFs are
created and configured while operator booted the machine with sriov_drivers_autoprobe
set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
through orchestration SW (IMHO, it is supported by all orchestration SW).

And only last part (assigning to users) is time-sensitive operation.

The VF creation and configuration are also time-sensitive in some cases, for
example, the hypervisor live update case (such as [1]):
   save VMs -> kexec -> restore VMs

After the new kernel starts, the VFs must be added into the system, and then
assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
at once and increase the downtime.

If we can enable the VFs that are used by existing VMs then restore the VMs
and enable other unused VFs at last, the downtime would be significantly
reduced.

[1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf

Like it is written in presentation, the standard way of doing it is done
by VFIO live migration feature, where 2K+ VMs are migrated to another server
at the time first server is scheduled for maintenance.

Live migration is not the best choice in production environment, it's too
heavy. Some cloud providers prefer to using hypervisor live update in their
system, such as AWS's nitro hypervisor.

How is AWS nitro relevant to our discussion about adding sysfs file to Linux?
Can you please point us to the source code of that hypervisor? Does it even
run on Linux?

Um...You can google for more information about the AWS nitro system.

Yes, it's digressive, so let's back to the discussion about adding sysfs 
file.

Anyway, I'm aware of big cloud providers who are pretty happy with live
migration in production.

We're having trouble coming to an agreement on this point, but it does't 
matter. Please see below.

However, even in live update case mentioned in the presentation, you
should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
so you don't need per-VF id enable knob.

The presentation is just a reference, some points could be optimized
including disable PFs/VFs and enable PFs/VFs.

Hypervisor live update can finish in less than 1 second, so the cost of
disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too
high.

What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
mostly support 10 concurrent calls if there has 10 PFs.

I wondered, are you using real HW? or QEMU SR-IOV? What is your server
that supports such large number of VFs?

Physical device. Some devices in the market support the large number of VFs,
especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
software should keep pace with times too.

Our devices (and Intel too) support many VFs too. The thing is that
servers are unlikely to be able to support 10 physical devices with 2K+
VFs. There are many limitations that will make such is not usable.
Like, global MSI-X pool and PCI bandwidth to support all these devices.

BTW, Your change will probably break all SR-IOV devices in the market as
they rely on PCI subsystem to have VFs ready and configured.

I see, but maybe this change could be a choice for some users.

It should come with relevant driver changes and very strong justification why
such functionality is needed now and can't be achieved by anything else
except user-facing sysfs.

Adding 2K+ VFs to the sysfs need too much time.

Look at the bottomhalf of the hypervisor live update:
kexec --> add 2K VFs --> restore VMs

The downtime can be reduced if the sequence is:
kexec --> add 100 VFs（the VMs used） --> resotre VMs --> add 1.9K VFs

Addition of VFs is serial operation, you can fire your VMs once you
counted 100 VFs in sysfs directory.

According to the current implementation, the addition of VFs must be in 
order, so this can not properly work.

For example, the VM uses VF200, VF202, VF204, but the sriov_add_vfs can 
only add VFs in the order VF0, VF1, VF2 ... The limitation is introduced 
by the software, not the PCI spec.

I don't see anything in this presentation and discussion that supports
need of such UAPI.
  > Thanks

Thanks
.
.
.