Re: [RFC 0/4] pci/sriov: support VFs dynamic addition

"Oliver O'Halloran" <oohall@xxxxxxxxx> · Tue, 15 Nov 2022 12:50:34 +1100

On Tue, Nov 15, 2022 at 1:27 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>
> *snip*
>
> Anyway, I'm aware of big cloud providers who are pretty happy with live
> migration in production.

I could see someone sufficiently cloudbrained deciding that rebooting
the hypervisor is fine provided the downtime doesn't violate any
customer uptime SLAs. Personally I'd only be brave enough to do that
for a HV hosting internal services which I know are behind a load
balancer, but apparently there are people at Huawei far braver than I.

> *snip*
>
> > Adding 2K+ VFs to the sysfs need too much time.
> >
> > Look at the bottomhalf of the hypervisor live update:
> > kexec --> add 2K VFs --> restore VMs
> >
> > The downtime can be reduced if the sequence is:
> > kexec --> add 100 VFs（the VMs used） --> resotre VMs --> add 1.9K VFs
>
> Addition of VFs is serial operation, you can fire your VMs once you
> counted 100 VFs in sysfs directory.

I don't know if making that kind of assumption about the behaviour of
sysfs is better or worse than just adding another knob. If at some
point in the future the initialisation of VF pci_devs was moved to a
workqueue or something we'd be violating that assumption without
breaking any of the documented ABI. I guess you could argue that VFs
being added sequentially is "ABI", but userspace has always been told
not to make assumptions about when sysfs attributes (or nodes, I
guess) appear since doing so is prone to races.