On Tue, Jul 30, 2024 at 11:29 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > > On 2024/07/30 12:17, Jason Wang wrote: > > On Tue, Jul 30, 2024 at 11:12 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > >> > >> On 2024/07/30 12:03, Jason Wang wrote: > >>> On Tue, Jul 30, 2024 at 10:57 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > >>>> > >>>> On 2024/07/30 11:04, Jason Wang wrote: > >>>>> On Tue, Jul 30, 2024 at 12:43 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > >>>>>> > >>>>>> On 2024/07/29 23:29, Peter Xu wrote: > >>>>>>> On Mon, Jul 29, 2024 at 01:45:12PM +0900, Akihiko Odaki wrote: > >>>>>>>> On 2024/07/29 12:50, Jason Wang wrote: > >>>>>>>>> On Sun, Jul 28, 2024 at 11:19 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>> On 2024/07/27 5:47, Peter Xu wrote: > >>>>>>>>>>> On Fri, Jul 26, 2024 at 04:17:12PM +0100, Daniel P. Berrangé wrote: > >>>>>>>>>>>> On Fri, Jul 26, 2024 at 10:43:42AM -0400, Peter Xu wrote: > >>>>>>>>>>>>> On Fri, Jul 26, 2024 at 09:48:02AM +0100, Daniel P. Berrangé wrote: > >>>>>>>>>>>>>> On Fri, Jul 26, 2024 at 09:03:24AM +0200, Thomas Huth wrote: > >>>>>>>>>>>>>>> On 26/07/2024 08.08, Michael S. Tsirkin wrote: > >>>>>>>>>>>>>>>> On Thu, Jul 25, 2024 at 06:18:20PM -0400, Peter Xu wrote: > >>>>>>>>>>>>>>>>> On Tue, Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich wrote: > >>>>>>>>>>>>>>>>>> USO features of virtio-net device depend on kernel ability > >>>>>>>>>>>>>>>>>> to support them, for backward compatibility by default the > >>>>>>>>>>>>>>>>>> features are disabled on 8.0 and earlier. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Signed-off-by: Yuri Benditovich <yuri.benditovich@xxxxxxxxxx> > >>>>>>>>>>>>>>>>>> Signed-off-by: Andrew Melnychecnko <andrew@xxxxxxxxxx> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Looks like this patch broke migration when the VM starts on a host that has > >>>>>>>>>>>>>>>>> USO supported, to another host that doesn't.. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> This was always the case with all offloads. The answer at the moment is, > >>>>>>>>>>>>>>>> don't do this. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> May I ask for my understanding: > >>>>>>>>>>>>>>> "don't do this" = don't automatically enable/disable virtio features in QEMU > >>>>>>>>>>>>>>> depending on host kernel features, or "don't do this" = don't try to migrate > >>>>>>>>>>>>>>> between machines that have different host kernel features? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Long term, we need to start exposing management APIs > >>>>>>>>>>>>>>>> to discover this, and management has to disable unsupported features. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Ack, this likely needs some treatments from the libvirt side, too. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> When QEMU automatically toggles machine type featuers based on host > >>>>>>>>>>>>>> kernel, relying on libvirt to then disable them again is impractical, > >>>>>>>>>>>>>> as we cannot assume that the libvirt people are using knows about > >>>>>>>>>>>>>> newly introduced features. Even if libvirt is updated to know about > >>>>>>>>>>>>>> it, people can easily be using a previous libvirt release. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> QEMU itself needs to make the machine types do that they are there > >>>>>>>>>>>>>> todo, which is to define a stable machine ABI. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> What QEMU is missing here is a "platform ABI" concept, to encode > >>>>>>>>>>>>>> sets of features which are tied to specific platform generations. > >>>>>>>>>>>>>> As long as we don't have that we'll keep having these broken > >>>>>>>>>>>>>> migration problems from machine types dynamically changing instead > >>>>>>>>>>>>>> of providing a stable guest ABI. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Any more elaboration on this idea? Would it be easily feasible in > >>>>>>>>>>>>> implementation? > >>>>>>>>>>>> > >>>>>>>>>>>> In terms of launching QEMU I'd imagine: > >>>>>>>>>>>> > >>>>>>>>>>>> $QEMU -machine pc-q35-9.1 -platform linux-6.9 ...args... > >>>>>>>>>>>> > >>>>>>>>>>>> Any virtual machine HW features which are tied to host kernel features > >>>>>>>>>>>> would have their defaults set based on the requested -platform. The > >>>>>>>>>>>> -machine will be fully invariant wrt the host kernel. > >>>>>>>>>>>> > >>>>>>>>>>>> You would have -platform hlep to list available platforms, and > >>>>>>>>>>>> corresonding QMP "query-platforms" command to list what platforms > >>>>>>>>>>>> are supported on a given host OS. > >>>>>>>>>>>> > >>>>>>>>>>>> Downstream distros can provide their own platforms definitions > >>>>>>>>>>>> (eg "linux-rhel-9.5") if they have kernels whose feature set > >>>>>>>>>>>> diverges from upstream due to backports. > >>>>>>>>>>>> > >>>>>>>>>>>> Mgmt apps won't need to be taught about every single little QEMU > >>>>>>>>>>>> setting whose default is derived from the kernel. Individual > >>>>>>>>>>>> defaults are opaque and controlled by the requested platform. > >>>>>>>>>>>> > >>>>>>>>>>>> Live migration has clearly defined semantics, and mgmt app can > >>>>>>>>>>>> use query-platforms to validate two hosts are compatible. > >>>>>>>>>>>> > >>>>>>>>>>>> Omitting -platform should pick the very latest platform that is > >>>>>>>>>>>> cmpatible with the current host (not neccessarily the latest > >>>>>>>>>>>> platform built-in to QEMU). > >>>>>>>>>>> > >>>>>>>>>>> This seems to add one more layer to maintain, and so far I don't know > >>>>>>>>>>> whether it's a must. > >>>>>>>>>>> > >>>>>>>>>>> To put it simple, can we simply rely on qemu cmdline as "the guest ABI"? I > >>>>>>>>>>> thought it was mostly the case already, except some extremely rare > >>>>>>>>>>> outliers. > >>>>>>>>>>> > >>>>>>>>>>> When we have one host that boots up a VM using: > >>>>>>>>>>> > >>>>>>>>>>> $QEMU1 $cmdline > >>>>>>>>>>> > >>>>>>>>>>> Then another host boots up: > >>>>>>>>>>> > >>>>>>>>>>> $QEMU2 $cmdline -incoming XXX > >>>>>>>>>>> > >>>>>>>>>>> Then migration should succeed if $cmdline is exactly the same, and the VM > >>>>>>>>>>> can boot up all fine without errors on both sides. > >>>>>>>>>>> > >>>>>>>>>>> AFAICT this has nothing to do with what kernel is underneath, even not > >>>>>>>>>>> Linux? I think either QEMU1 / QEMU2 has the option to fail. But if it > >>>>>>>>>>> didn't, I thought the ABI should be guaranteed. > >>>>>>>>>>> > >>>>>>>>>>> That's why I think this is a migration violation, as 99.99% of other device > >>>>>>>>>>> properties should be following this rule. The issue here is, we have the > >>>>>>>>>>> same virtio-net-pci cmdline on both sides in this case, but the ABI got > >>>>>>>>>>> break. > >>>>>>>>>>> > >>>>>>>>>>> That's also why I was suggesting if the property contributes to the guest > >>>>>>>>>>> ABI, then AFAIU QEMU needs to: > >>>>>>>>>>> > >>>>>>>>>>> - Firstly, never quietly flipping any bit that affects the ABI... > >>>>>>>>>>> > >>>>>>>>>>> - Have a default value of off, then QEMU will always allow the VM to boot > >>>>>>>>>>> by default, while advanced users can opt-in on new features. We can't > >>>>>>>>>>> make this ON by default otherwise some VMs can already fail to boot, > >>>>>>>>>> > >>>>>>>>>> It may not be necessary the case that old features are supported by > >>>>>>>>>> every systems. In an extreme case, a user may migrate a VM from Linux to > >>>>>>>>>> Windows, which probably doesn't support any offloading at all. A more > >>>>>>>>>> convincing scenario is RSS offloading with eBPF; using eBPF requires a > >>>>>>>>>> privilege so we cannot assume it is always available even on the latest > >>>>>>>>>> version of Linux. > >>>>>>>>> > >>>>>>>>> I don't get why eBPF matters here. It is something that is not noticed > >>>>>>>>> by the guest and we have a fallback anyhow. > >>>>>> > >>>>>> It is noticeable for the guest, and the fallback is not effective with > >>>>>> vhost. > >>>>> > >>>>> It's a bug then. Qemu can fallback to tuntap if it sees issues in vhost. > >>>> > >>>> We can certainly fallback to in-QEMU RSS by disabling vhost, but I would > >>>> not say lack of such fallback is a bug. > >>> > >>> Such fallback is by design since the introduction of vhost. > >>> > >>>> We don't provide in-QEMU > >>>> fallback for other offloads. > >>> > >>> Yes but what I want to say is that eBPF RSS is different from those > >>> segmentation offloads. And technically, Qemu can do fallback for > >>> offloads (as RSC did). > >> > >> Well, I couldn't find any code disabling vhost for the in-QEMU RSC > >> implementation. > > > > It should be a bug (and I remember we disabled vhost when the patches > > were merged). Have you tested it in a guest to see if it can see RSC > > when vhost is enabled? > > > > I suspect we need to add the RSC bit into current kernel_feature_bits: > > > > /* Features supported by host kernel. */ > > static const int kernel_feature_bits[] = { > > VIRTIO_F_NOTIFY_ON_EMPTY, > > VIRTIO_RING_F_INDIRECT_DESC, > > VIRTIO_RING_F_EVENT_IDX, > > VIRTIO_NET_F_MRG_RXBUF, > > VIRTIO_F_VERSION_1, > > VIRTIO_NET_F_MTU, > > VIRTIO_F_IOMMU_PLATFORM, > > VIRTIO_F_RING_PACKED, > > VIRTIO_F_RING_RESET, > > VIRTIO_NET_F_HASH_REPORT, > > VHOST_INVALID_FEATURE_BIT > > }; > > > > As RSC won't be provided by TUN/TAP anyhow. > > Adding the RSC bit does not let QEMU disable vhost for RSC, but instead > it implicitly disables RSC in my understanding. Yes. > It is still better than > advertising the availability of that feature while it is missing. Down the road, we probably need to change the behaviour of disabling vhost-net. > > > > >> > >> Looking at the code, I also found the case of vhost-vdpa. vhost can be > >> simply disabled if it is backed by tuntap, but it is not the case for vDPA. > > > > True, technically, vDPA can fallback to SVQ, but it's another topic. > > My point of this discussion is that we cannot enable features just > because they are sufficiently old or because the user claims QEMU runs > on Linux sufficiently new. eBPF requires privilege, and vDPA requires > hardware feature. A fallback is not a silver bullet either, and there > are situations that providing a fallback is not a trivial task. To make sure we are on the same page. I just want to point out that eBPF RSS is not a good example in this context. It works only for tuntap, so we should stick to the behaviour of trying to fallback to userspace if we can as we've already had a userspace fallback. This is the fundamental difference with other features (like segmentation offload) or backend (vDPA) that doesn't have an existing fallback. Thanks > > Regards, > Akihiko Odaki >