Re: [PATCH v5 03/26] x86/hyperv: Update 'struct hv_enlightened_vmcs' definition

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 22 Aug 2022 18:32:23 +0000

On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
> 
> > On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote:
> >> My initial implementation was inventing 'eVMCS revision' concept:
> >> https://lore.kernel.org/kvm/20220629150625.238286-7-vkuznets@xxxxxxxxxx/
> >> 
> >> which is needed if we don't gate all these new fields on CPUID.0x4000000A.EBX BIT(0).
> >> 
> >> Going forward, we will still (likely) need something when new fields show up.
> >
> > My comments from that thread still apply.  Adding "revisions" or feature flags
> > isn't maintanable, e.g. at best KVM will end up with a ridiculous number of flags.
> >
> > Looking at QEMU, which I strongly suspect is the only VMM that enables
> > KVM_CAP_HYPERV_ENLIGHTENED_VMCS, it does the sane thing of enabling the capability
> > before grabbing the VMX MSRs.
> >
> > So, why not simply apply filtering for host accesses as well?
> 
> (I understand that using QEMU to justify KVM's behavior is flawed but...)
> 
> QEMU's migration depends on the assumption that identical QEMU's command
> lines create identical (from guest PoV) configurations. Assume we have
> (simplified)
> 
> "-cpu CascadeLake-Sever,hv-evmcs"
> 
> on both source and destination but source host is newer, i.e. its KVM
> knows about TSC Scaling in eVMCS and destination host has no idea about
> it. If we just apply filtering upon vCPU creation, guest visible MSR
> values are going to be different, right? Ok, assuming QEMU also migrates
> VMX feature MSRs (TODO: check if that's true), we will be able to fail
> mirgration late (which is already much worse than not being able to
> create the desired configuration on destination, 'fail early') if we use
> in-KVM filtering to throw an error to userspace. But if we blindly
> filter control MSRs on the destination, 'TscScaling' will just disapper
> undreneath the guest. This is unlikely to work.

But all of that holds true irrespetive of eVMCS.  If QEMU attempts to migrate a
nested guest from a KVM that supports TSC_SCALING to a KVM that doesn't support
TSC_SCALING, then TSC_SCALING is going to disappear and VM-Entry on the dest will
fail.  I.e. QEMU _must_ be able to detect the incompatibility and not attempt
the migration.  And with that code in place, QEMU doesn't need to do anything new
for eVMCS, it Just Works.

> In any case, what we need, is an option for VMM (read: QEMU) to create
> the configuration with 'TscScaling' filtered out even KVM supports the
> bit in eVMCS. This way the guest will be able to migrate backwards to an
> older KVM which doesn't support it, i.e.
> 
> '-cpu CascadeLake-Sever,hv-evmcs'
>  creates the 'origin' eVMCS configuration, no TscScaling
> 
> '-cpu CascadeLake-Sever,hv-evmcs,hv-evmcs-2022' creates the updated one.

Again, this conundrum exists irrespective of eVMCS.  Properly solve the problem
for regular nVMX and eVMCS should naturally work.

> KVM_CAP_HYPERV_ENLIGHTENED_VMCS is bad as it only takes 'eVMCS' version
> as a parameter (as we assumed it will always change when new fields are
> added, but that turned out to be false). That's why I suggested
> KVM_CAP_HYPERV_ENLIGHTENED_VMCS2.

Enumerating features via versions is such a bad API though, e.g. if there's a
bug with nested TSC_SCALING, userspace can't disable just nested TSC_SCALING
without everything else under the inscrutable "hv-evmcs-2022" being collateral
damage.