On 12/7/2022 9:54 AM, Christoph Hellwig wrote:
On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
What the kernel is doing is providing the abstraction to link the
controlling function to the VFIO device in a general way.
We don't want to just punt this problem to user space and say 'good
luck finding the right cdev for migration control'. If the kernel
struggles to link them then userspace will not fare better on its own.
Yes. But the right interface for that is to issue the userspace
commands for anything that is not normal PCIe function level
to the controlling funtion, and to discover the controlled functions
based on the controlling functions.
In other words: there should be absolutely no need to have any
special kernel support for the controlled function. Instead the
controlling function enumerates all the function it controls exports
that to userspace and exposes the functionality to save state from
and restore state to the controlled functions.
Why is it preferred that the migration SW will talk directly to the PF
and not via VFIO interface ?
It's just an implementation detail.
I feel like it's even sounds more reasonable to have a common API like
we have today to save_state/resume_state/quiesce_device/freeze_device
and each device implementation will translate this functionality to its
own SPEC.
If I understand your direction is to have QEMU code to talk to
nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.
The controlled device is not aware of any of the migration process. Only
the migration SW, system admin and controlling device.
I see 2 orthogonal discussions here: NVMe standardization for LM and
Linux implementation for LM.
For the NVMe standardization: I think we all agree, in high level, that
primary controller manages the LM of the secondary controllers. Primary
controller can list the secondary controllers. Primary controller expose
APIs using its admin_queue to manage LM process of its secondary
controllers. LM Capabilities will be exposed using identify_ctrl admin
cmd of the primary controller.
For the Linux implementation: the direction we started last year is to
have vendor specific (mlx5/hisi/..) or protocol specific
(nvme/virtio/..) vfio drivers. We built an infrastructure to do that by
dividing the vfio_pci driver to vfio_pci and vfio_pci_core and updated
uAPIs as well to support the P2P case for live migration. Dirty page
tracking is also progressing. More work is still to be done to improve
this infrastructure for sure.
I hope that all the above efforts are going to be used also with NVMe LM
implementation unless there is something NVMe specific in the way of
migrating PCI functions that I can't see now.
If there is something that is NVMe specific for LM then the migration SW
and QEMU will need to be aware of that, and in this awareness we lose
the benefit of generic VFIO interface.
Especially, we do not want every VFIO device to have its own crazy way
for userspace to link the controlling/controlled functions
together. This is something the kernel has to abstract away.
Yes. But the direction must go controlling to controlled, not the
other way around.
So in the source:
1. We enable SRIOV on the NVMe driver
2. We list all the secondary controllers: nvme1, nvme2, nvme3
3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable
(controlling to controlled).
4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver
5. We pass these functions to VM
6. We start migration process.
And in the destination:
1. We enable SRIOV on the NVMe driver
2. We list all the secondary controllers: nvme1, nvme2, nvme3
3. We allow migration resume to nvme1, nvme2, nvme3 - now these VFs are
resumable (controlling to controlled).
4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver
5. We pass these functions to VM
6. We start migration resume process.