Re: [PATCH v2 4/4] virtio-net: Add support for USO features

Jason Wang <jasowang@xxxxxxxxxx> · Wed, 31 Jul 2024 10:05:05 +0800

On Tue, Jul 30, 2024 at 6:23 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
>
> On 2024/07/30 12:45, Jason Wang wrote:
> > On Tue, Jul 30, 2024 at 11:29 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
> >>
> >> On 2024/07/30 12:17, Jason Wang wrote:
> >>> On Tue, Jul 30, 2024 at 11:12 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
> >>>>
> >>>> On 2024/07/30 12:03, Jason Wang wrote:
> >>>>> On Tue, Jul 30, 2024 at 10:57 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> On 2024/07/30 11:04, Jason Wang wrote:
> >>>>>>> On Tue, Jul 30, 2024 at 12:43 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> On 2024/07/29 23:29, Peter Xu wrote:
> >>>>>>>>> On Mon, Jul 29, 2024 at 01:45:12PM +0900, Akihiko Odaki wrote:
> >>>>>>>>>> On 2024/07/29 12:50, Jason Wang wrote:
> >>>>>>>>>>> On Sun, Jul 28, 2024 at 11:19 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2024/07/27 5:47, Peter Xu wrote:
> >>>>>>>>>>>>> On Fri, Jul 26, 2024 at 04:17:12PM +0100, Daniel P. Berrangé wrote:
> >>>>>>>>>>>>>> On Fri, Jul 26, 2024 at 10:43:42AM -0400, Peter Xu wrote:
> >>>>>>>>>>>>>>> On Fri, Jul 26, 2024 at 09:48:02AM +0100, Daniel P. Berrangé wrote:
> >>>>>>>>>>>>>>>> On Fri, Jul 26, 2024 at 09:03:24AM +0200, Thomas Huth wrote:
> >>>>>>>>>>>>>>>>> On 26/07/2024 08.08, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>>>>> On Thu, Jul 25, 2024 at 06:18:20PM -0400, Peter Xu wrote:
> >>>>>>>>>>>>>>>>>>> On Tue, Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich wrote:
> >>>>>>>>>>>>>>>>>>>> USO features of virtio-net device depend on kernel ability
> >>>>>>>>>>>>>>>>>>>> to support them, for backward compatibility by default the
> >>>>>>>>>>>>>>>>>>>> features are disabled on 8.0 and earlier.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Signed-off-by: Yuri Benditovich <yuri.benditovich@xxxxxxxxxx>
> >>>>>>>>>>>>>>>>>>>> Signed-off-by: Andrew Melnychecnko <andrew@xxxxxxxxxx>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Looks like this patch broke migration when the VM starts on a host that has
> >>>>>>>>>>>>>>>>>>> USO supported, to another host that doesn't..
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This was always the case with all offloads. The answer at the moment is,
> >>>>>>>>>>>>>>>>>> don't do this.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> May I ask for my understanding:
> >>>>>>>>>>>>>>>>> "don't do this" = don't automatically enable/disable virtio features in QEMU
> >>>>>>>>>>>>>>>>> depending on host kernel features, or "don't do this" = don't try to migrate
> >>>>>>>>>>>>>>>>> between machines that have different host kernel features?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Long term, we need to start exposing management APIs
> >>>>>>>>>>>>>>>>>> to discover this, and management has to disable unsupported features.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Ack, this likely needs some treatments from the libvirt side, too.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> When QEMU automatically toggles machine type featuers based on host
> >>>>>>>>>>>>>>>> kernel, relying on libvirt to then disable them again is impractical,
> >>>>>>>>>>>>>>>> as we cannot assume that the libvirt people are using knows about
> >>>>>>>>>>>>>>>> newly introduced features. Even if libvirt is updated to know about
> >>>>>>>>>>>>>>>> it, people can easily be using a previous libvirt release.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> QEMU itself needs to make the machine types do that they are there
> >>>>>>>>>>>>>>>> todo, which is to define a stable machine ABI.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> What QEMU is missing here is a "platform ABI" concept, to encode
> >>>>>>>>>>>>>>>> sets of features which are tied to specific platform generations.
> >>>>>>>>>>>>>>>> As long as we don't have that we'll keep having these broken
> >>>>>>>>>>>>>>>> migration problems from machine types dynamically changing instead
> >>>>>>>>>>>>>>>> of providing a stable guest ABI.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Any more elaboration on this idea?  Would it be easily feasible in
> >>>>>>>>>>>>>>> implementation?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In terms of launching QEMU I'd imagine:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>          $QEMU -machine pc-q35-9.1 -platform linux-6.9 ...args...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Any virtual machine HW features which are tied to host kernel features
> >>>>>>>>>>>>>> would have their defaults set based on the requested -platform. The
> >>>>>>>>>>>>>> -machine will be fully invariant wrt the host kernel.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You would have -platform hlep to list available platforms, and
> >>>>>>>>>>>>>> corresonding QMP "query-platforms" command to list what platforms
> >>>>>>>>>>>>>> are supported on a given host OS.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Downstream distros can provide their own platforms definitions
> >>>>>>>>>>>>>> (eg "linux-rhel-9.5") if they have kernels whose feature set
> >>>>>>>>>>>>>> diverges from upstream due to backports.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Mgmt apps won't need to be taught about every single little QEMU
> >>>>>>>>>>>>>> setting whose default is derived from the kernel. Individual
> >>>>>>>>>>>>>> defaults are opaque and controlled by the requested platform.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Live migration has clearly defined semantics, and mgmt app can
> >>>>>>>>>>>>>> use query-platforms to validate two hosts are compatible.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Omitting -platform should pick the very latest platform that is
> >>>>>>>>>>>>>> cmpatible with the current host (not neccessarily the latest
> >>>>>>>>>>>>>> platform built-in to QEMU).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This seems to add one more layer to maintain, and so far I don't know
> >>>>>>>>>>>>> whether it's a must.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To put it simple, can we simply rely on qemu cmdline as "the guest ABI"?  I
> >>>>>>>>>>>>> thought it was mostly the case already, except some extremely rare
> >>>>>>>>>>>>> outliers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> When we have one host that boots up a VM using:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>          $QEMU1 $cmdline
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Then another host boots up:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>          $QEMU2 $cmdline -incoming XXX
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Then migration should succeed if $cmdline is exactly the same, and the VM
> >>>>>>>>>>>>> can boot up all fine without errors on both sides.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> AFAICT this has nothing to do with what kernel is underneath, even not
> >>>>>>>>>>>>> Linux?  I think either QEMU1 / QEMU2 has the option to fail.  But if it
> >>>>>>>>>>>>> didn't, I thought the ABI should be guaranteed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> That's why I think this is a migration violation, as 99.99% of other device
> >>>>>>>>>>>>> properties should be following this rule.  The issue here is, we have the
> >>>>>>>>>>>>> same virtio-net-pci cmdline on both sides in this case, but the ABI got
> >>>>>>>>>>>>> break.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> That's also why I was suggesting if the property contributes to the guest
> >>>>>>>>>>>>> ABI, then AFAIU QEMU needs to:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>          - Firstly, never quietly flipping any bit that affects the ABI...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>          - Have a default value of off, then QEMU will always allow the VM to boot
> >>>>>>>>>>>>>            by default, while advanced users can opt-in on new features.  We can't
> >>>>>>>>>>>>>            make this ON by default otherwise some VMs can already fail to boot,
> >>>>>>>>>>>>
> >>>>>>>>>>>> It may not be necessary the case that old features are supported by
> >>>>>>>>>>>> every systems. In an extreme case, a user may migrate a VM from Linux to
> >>>>>>>>>>>> Windows, which probably doesn't support any offloading at all. A more
> >>>>>>>>>>>> convincing scenario is RSS offloading with eBPF; using eBPF requires a
> >>>>>>>>>>>> privilege so we cannot assume it is always available even on the latest
> >>>>>>>>>>>> version of Linux.
> >>>>>>>>>>>
> >>>>>>>>>>> I don't get why eBPF matters here. It is something that is not noticed
> >>>>>>>>>>> by the guest and we have a fallback anyhow.
> >>>>>>>>
> >>>>>>>> It is noticeable for the guest, and the fallback is not effective with
> >>>>>>>> vhost.
> >>>>>>>
> >>>>>>> It's a bug then. Qemu can fallback to tuntap if it sees issues in vhost.
> >>>>>>
> >>>>>> We can certainly fallback to in-QEMU RSS by disabling vhost, but I would
> >>>>>> not say lack of such fallback is a bug.
> >>>>>
> >>>>> Such fallback is by design since the introduction of vhost.
> >>>>>
> >>>>>> We don't provide in-QEMU
> >>>>>> fallback for other offloads.
> >>>>>
> >>>>> Yes but what I want to say is that eBPF RSS is different from those
> >>>>> segmentation offloads. And technically, Qemu can do fallback for
> >>>>> offloads (as RSC did).
> >>>>
> >>>> Well, I couldn't find any code disabling vhost for the in-QEMU RSC
> >>>> implementation.
> >>>
> >>> It should be a bug (and I remember we disabled vhost when the patches
> >>> were merged). Have you tested it in a guest to see if it can see RSC
> >>> when vhost is enabled?
> >>>
> >>> I suspect we need to add the RSC bit into current kernel_feature_bits:
> >>>
> >>> /* Features supported by host kernel. */
> >>> static const int kernel_feature_bits[] = {
> >>>       VIRTIO_F_NOTIFY_ON_EMPTY,
> >>>       VIRTIO_RING_F_INDIRECT_DESC,
> >>>       VIRTIO_RING_F_EVENT_IDX,
> >>>       VIRTIO_NET_F_MRG_RXBUF,
> >>>       VIRTIO_F_VERSION_1,
> >>>       VIRTIO_NET_F_MTU,
> >>>       VIRTIO_F_IOMMU_PLATFORM,
> >>>       VIRTIO_F_RING_PACKED,
> >>>       VIRTIO_F_RING_RESET,
> >>>       VIRTIO_NET_F_HASH_REPORT,
> >>>       VHOST_INVALID_FEATURE_BIT
> >>> };
> >>>
> >>> As RSC won't be provided by TUN/TAP anyhow.
> >>
> >> Adding the RSC bit does not let QEMU disable vhost for RSC, but instead
> >> it implicitly disables RSC in my understanding.
> >
> > Yes.
> >
> >> It is still better than
> >> advertising the availability of that feature while it is missing.
> >
> > Down the road, we probably need to change the behaviour of disabling vhost-net.
> >
> >>
> >>>
> >>>>
> >>>> Looking at the code, I also found the case of vhost-vdpa. vhost can be
> >>>> simply disabled if it is backed by tuntap, but it is not the case for vDPA.
> >>>
> >>> True, technically, vDPA can fallback to SVQ, but it's another topic.
> >>
> >> My point of this discussion is that we cannot enable features just
> >> because they are sufficiently old or because the user claims QEMU runs
> >> on Linux sufficiently new. eBPF requires privilege, and vDPA requires
> >> hardware feature. A fallback is not a silver bullet either, and there
> >> are situations that providing a fallback is not a trivial task.
> >
> > To make sure we are on the same page. I just want to point out that
> > eBPF RSS is not a good example in this context.
> >
> > It works only for tuntap, so we should stick to the behaviour of
> > trying to fallback to userspace if we can as we've already had a
> > userspace fallback. This is the fundamental difference with other
> > features (like segmentation offload) or backend (vDPA) that doesn't
> > have an existing fallback.
>
> Some (probably not all) offloads are implemented in hw/net/net_tx_pkt.c.
> They are not wired up to behave as a fallback when tuntap's vhost is
> enabled as the in-QEMU RSS is not. In either case, we need to pay some
> effort to wiring things.
>
> I'm not sure it is worthwhile. I think there is a high chance that
> selectively disabling vhost and keeping RSS enabled with fallback will
> result in worse performance than keeping vhost enabled and disabling
> RSS. Such a fallback can still function as an emergency escape hatch,
> but it is also incomplete as we don't have fallbacks for other features.

The reason is that we depend on ioctl to configure and negotiate with
tuntap correctly.

> I would rather make any features missing in the vhost backend fail to
> keep things consistent.

You might be right but it's too late to do that.

Thanks

>
> Regards,
> Akihiko Odaki
>