Re: [PATCH v2 4/4] virtio-net: Add support for USO features

Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> · Thu, 1 Aug 2024 14:28:29 +0900

On 2024/08/01 11:28, Jason Wang wrote:
On Wed, Jul 31, 2024 at 8:58 PM Peter Xu <peterx@xxxxxxxxxx> wrote:

On Wed, Jul 31, 2024 at 03:41:00AM -0400, Michael S. Tsirkin wrote:
On Wed, Jul 31, 2024 at 08:04:24AM +0100, Daniel P. Berrangé wrote:
On Tue, Jul 30, 2024 at 05:32:48PM -0400, Michael S. Tsirkin wrote:
On Tue, Jul 30, 2024 at 04:03:53PM -0400, Peter Xu wrote:
On Tue, Jul 30, 2024 at 03:22:50PM -0400, Michael S. Tsirkin wrote:
This is not what we did historically. Why should we start now?

It's a matter of whether we still want migration to randomly fail, like
what this patch does.

Or any better suggestions?  I'm definitely open to that.

Thanks,

--
Peter Xu

Randomly is an overstatement. You need to switch between kernels
where this feature differs. We did it with a ton of features
in the past, donnu why we single out USO now.

This has been a problem with a ton of features in the past. We've
ignored the problem, but that doesn't make it the right solution

With regards,
Daniel

Pushing it to domain xml does not really help,
migration will still fail unexpectedly (after wasting
a ton of resources copying memory, and getting
a downtime bump, I might add).

Could you elaborate why it would fail if with what I proposed?

Note that if this is a generic comment about "any migration can fail if we
found a device mismatch", we have plan to fix that to some degree. It's
just that we don't have enough people working on these topics yet. See:

https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake

It includes:

  "Check device tree on both sides, etc., to make sure the migration is
   applicable. E.g., we should fail early and clearly on any device
   mismatch."

However I don't think it'll cover all checks, e.g. I _think_ even if we
verify VMSDs then post_load() hooks can still fail, and there can be some
corner cases to think.  And of course, this may not even apply to virtio
since virtio manages migration itself, without providing a top-level vmsd.

The right solution is to have a tool that can query
backends, and that given the results from all of the cluster,
generate a set of parameters that will ensure migration works.

This seems to be very hard for vhost-users.

Can you elaborate more? I was thinking something like follows:
1. Prepare a QEMU command line.
2. Run the command line appended with -dump-platform on all hosts, which 
dumps platform features automatically enabled. For virtio devices, we 
can dump "host_features" variable.
3. Run the command line appended with -merge-platform with all dumps. 
For most virtio devices, this would be AND operations on "host_features" 
variable.
4. Run the command line appended with -use-platform with the merged 
dump. This will run VMs with features available on all hosts.

I may have missed something but this seems good enough for me. Of course 
this requires changes throughout the stack (QEMU common and 
device-specific code, libvirt, and even higher layers like OpenStack).

Regards,
Akihiko Odaki