On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@xxxxxxxxxx> writes: > > > On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote: > >> My initial implementation was inventing 'eVMCS revision' concept: > >> https://lore.kernel.org/kvm/20220629150625.238286-7-vkuznets@xxxxxxxxxx/ > >> > >> which is needed if we don't gate all these new fields on CPUID.0x4000000A.EBX BIT(0). > >> > >> Going forward, we will still (likely) need something when new fields show up. > > > > My comments from that thread still apply. Adding "revisions" or feature flags > > isn't maintanable, e.g. at best KVM will end up with a ridiculous number of flags. > > > > Looking at QEMU, which I strongly suspect is the only VMM that enables > > KVM_CAP_HYPERV_ENLIGHTENED_VMCS, it does the sane thing of enabling the capability > > before grabbing the VMX MSRs. > > > > So, why not simply apply filtering for host accesses as well? > > (I understand that using QEMU to justify KVM's behavior is flawed but...) > > QEMU's migration depends on the assumption that identical QEMU's command > lines create identical (from guest PoV) configurations. Assume we have > (simplified) > > "-cpu CascadeLake-Sever,hv-evmcs" > > on both source and destination but source host is newer, i.e. its KVM > knows about TSC Scaling in eVMCS and destination host has no idea about > it. If we just apply filtering upon vCPU creation, guest visible MSR > values are going to be different, right? Ok, assuming QEMU also migrates > VMX feature MSRs (TODO: check if that's true), we will be able to fail > mirgration late (which is already much worse than not being able to > create the desired configuration on destination, 'fail early') if we use > in-KVM filtering to throw an error to userspace. But if we blindly > filter control MSRs on the destination, 'TscScaling' will just disapper > undreneath the guest. This is unlikely to work. But all of that holds true irrespetive of eVMCS. If QEMU attempts to migrate a nested guest from a KVM that supports TSC_SCALING to a KVM that doesn't support TSC_SCALING, then TSC_SCALING is going to disappear and VM-Entry on the dest will fail. I.e. QEMU _must_ be able to detect the incompatibility and not attempt the migration. And with that code in place, QEMU doesn't need to do anything new for eVMCS, it Just Works. > In any case, what we need, is an option for VMM (read: QEMU) to create > the configuration with 'TscScaling' filtered out even KVM supports the > bit in eVMCS. This way the guest will be able to migrate backwards to an > older KVM which doesn't support it, i.e. > > '-cpu CascadeLake-Sever,hv-evmcs' > creates the 'origin' eVMCS configuration, no TscScaling > > '-cpu CascadeLake-Sever,hv-evmcs,hv-evmcs-2022' creates the updated one. Again, this conundrum exists irrespective of eVMCS. Properly solve the problem for regular nVMX and eVMCS should naturally work. > KVM_CAP_HYPERV_ENLIGHTENED_VMCS is bad as it only takes 'eVMCS' version > as a parameter (as we assumed it will always change when new fields are > added, but that turned out to be false). That's why I suggested > KVM_CAP_HYPERV_ENLIGHTENED_VMCS2. Enumerating features via versions is such a bad API though, e.g. if there's a bug with nested TSC_SCALING, userspace can't disable just nested TSC_SCALING without everything else under the inscrutable "hv-evmcs-2022" being collateral damage.