Re: [PATCH v2 4/4] virtio-net: Add support for USO features

Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> · Tue, 30 Jul 2024 12:28:56 +0900

On 2024/07/30 12:17, Jason Wang wrote:
On Tue, Jul 30, 2024 at 11:12 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2024/07/30 12:03, Jason Wang wrote:
On Tue, Jul 30, 2024 at 10:57 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2024/07/30 11:04, Jason Wang wrote:
On Tue, Jul 30, 2024 at 12:43 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2024/07/29 23:29, Peter Xu wrote:
On Mon, Jul 29, 2024 at 01:45:12PM +0900, Akihiko Odaki wrote:
On 2024/07/29 12:50, Jason Wang wrote:
On Sun, Jul 28, 2024 at 11:19 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2024/07/27 5:47, Peter Xu wrote:
On Fri, Jul 26, 2024 at 04:17:12PM +0100, Daniel P. Berrangé wrote:
On Fri, Jul 26, 2024 at 10:43:42AM -0400, Peter Xu wrote:
On Fri, Jul 26, 2024 at 09:48:02AM +0100, Daniel P. Berrangé wrote:
On Fri, Jul 26, 2024 at 09:03:24AM +0200, Thomas Huth wrote:
On 26/07/2024 08.08, Michael S. Tsirkin wrote:
On Thu, Jul 25, 2024 at 06:18:20PM -0400, Peter Xu wrote:
On Tue, Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich wrote:
USO features of virtio-net device depend on kernel ability
to support them, for backward compatibility by default the
features are disabled on 8.0 and earlier.

Signed-off-by: Yuri Benditovich <yuri.benditovich@xxxxxxxxxx>
Signed-off-by: Andrew Melnychecnko <andrew@xxxxxxxxxx>

Looks like this patch broke migration when the VM starts on a host that has
USO supported, to another host that doesn't..

This was always the case with all offloads. The answer at the moment is,
don't do this.

May I ask for my understanding:
"don't do this" = don't automatically enable/disable virtio features in QEMU
depending on host kernel features, or "don't do this" = don't try to migrate
between machines that have different host kernel features?

Long term, we need to start exposing management APIs
to discover this, and management has to disable unsupported features.

Ack, this likely needs some treatments from the libvirt side, too.

When QEMU automatically toggles machine type featuers based on host
kernel, relying on libvirt to then disable them again is impractical,
as we cannot assume that the libvirt people are using knows about
newly introduced features. Even if libvirt is updated to know about
it, people can easily be using a previous libvirt release.

QEMU itself needs to make the machine types do that they are there
todo, which is to define a stable machine ABI.

What QEMU is missing here is a "platform ABI" concept, to encode
sets of features which are tied to specific platform generations.
As long as we don't have that we'll keep having these broken
migration problems from machine types dynamically changing instead
of providing a stable guest ABI.

Any more elaboration on this idea?  Would it be easily feasible in
implementation?

In terms of launching QEMU I'd imagine:

        $QEMU -machine pc-q35-9.1 -platform linux-6.9 ...args...

Any virtual machine HW features which are tied to host kernel features
would have their defaults set based on the requested -platform. The
-machine will be fully invariant wrt the host kernel.

You would have -platform hlep to list available platforms, and
corresonding QMP "query-platforms" command to list what platforms
are supported on a given host OS.

Downstream distros can provide their own platforms definitions
(eg "linux-rhel-9.5") if they have kernels whose feature set
diverges from upstream due to backports.

Mgmt apps won't need to be taught about every single little QEMU
setting whose default is derived from the kernel. Individual
defaults are opaque and controlled by the requested platform.

Live migration has clearly defined semantics, and mgmt app can
use query-platforms to validate two hosts are compatible.

Omitting -platform should pick the very latest platform that is
cmpatible with the current host (not neccessarily the latest
platform built-in to QEMU).

This seems to add one more layer to maintain, and so far I don't know
whether it's a must.

To put it simple, can we simply rely on qemu cmdline as "the guest ABI"?  I
thought it was mostly the case already, except some extremely rare
outliers.

When we have one host that boots up a VM using:

        $QEMU1 $cmdline

Then another host boots up:

        $QEMU2 $cmdline -incoming XXX

Then migration should succeed if $cmdline is exactly the same, and the VM
can boot up all fine without errors on both sides.

AFAICT this has nothing to do with what kernel is underneath, even not
Linux?  I think either QEMU1 / QEMU2 has the option to fail.  But if it
didn't, I thought the ABI should be guaranteed.

That's why I think this is a migration violation, as 99.99% of other device
properties should be following this rule.  The issue here is, we have the
same virtio-net-pci cmdline on both sides in this case, but the ABI got
break.

That's also why I was suggesting if the property contributes to the guest
ABI, then AFAIU QEMU needs to:

        - Firstly, never quietly flipping any bit that affects the ABI...

        - Have a default value of off, then QEMU will always allow the VM to boot
          by default, while advanced users can opt-in on new features.  We can't
          make this ON by default otherwise some VMs can already fail to boot,

It may not be necessary the case that old features are supported by
every systems. In an extreme case, a user may migrate a VM from Linux to
Windows, which probably doesn't support any offloading at all. A more
convincing scenario is RSS offloading with eBPF; using eBPF requires a
privilege so we cannot assume it is always available even on the latest
version of Linux.

I don't get why eBPF matters here. It is something that is not noticed
by the guest and we have a fallback anyhow.

It is noticeable for the guest, and the fallback is not effective with
vhost.

It's a bug then. Qemu can fallback to tuntap if it sees issues in vhost.

We can certainly fallback to in-QEMU RSS by disabling vhost, but I would
not say lack of such fallback is a bug.

Such fallback is by design since the introduction of vhost.

We don't provide in-QEMU
fallback for other offloads.

Yes but what I want to say is that eBPF RSS is different from those
segmentation offloads. And technically, Qemu can do fallback for
offloads (as RSC did).

Well, I couldn't find any code disabling vhost for the in-QEMU RSC
implementation.

It should be a bug (and I remember we disabled vhost when the patches
were merged). Have you tested it in a guest to see if it can see RSC
when vhost is enabled?

I suspect we need to add the RSC bit into current kernel_feature_bits:

/* Features supported by host kernel. */
static const int kernel_feature_bits[] = {
     VIRTIO_F_NOTIFY_ON_EMPTY,
     VIRTIO_RING_F_INDIRECT_DESC,
     VIRTIO_RING_F_EVENT_IDX,
     VIRTIO_NET_F_MRG_RXBUF,
     VIRTIO_F_VERSION_1,
     VIRTIO_NET_F_MTU,
     VIRTIO_F_IOMMU_PLATFORM,
     VIRTIO_F_RING_PACKED,
     VIRTIO_F_RING_RESET,
     VIRTIO_NET_F_HASH_REPORT,
     VHOST_INVALID_FEATURE_BIT
};

As RSC won't be provided by TUN/TAP anyhow.

Adding the RSC bit does not let QEMU disable vhost for RSC, but instead 
it implicitly disables RSC in my understanding. It is still better than 
advertising the availability of that feature while it is missing.

Looking at the code, I also found the case of vhost-vdpa. vhost can be
simply disabled if it is backed by tuntap, but it is not the case for vDPA.

True, technically, vDPA can fallback to SVQ, but it's another topic.

My point of this discussion is that we cannot enable features just 
because they are sufficiently old or because the user claims QEMU runs 
on Linux sufficiently new. eBPF requires privilege, and vDPA requires 
hardware feature. A fallback is not a silver bullet either, and there 
are situations that providing a fallback is not a trivial task.

Regards,
Akihiko Odaki