Sean Christopherson <seanjc@xxxxxxxxxx> writes: > On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote: >> Sean Christopherson <seanjc@xxxxxxxxxx> writes: >> >> > On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote: >> >> My initial implementation was inventing 'eVMCS revision' concept: >> >> https://lore.kernel.org/kvm/20220629150625.238286-7-vkuznets@xxxxxxxxxx/ >> >> >> >> which is needed if we don't gate all these new fields on CPUID.0x4000000A.EBX BIT(0). >> >> >> >> Going forward, we will still (likely) need something when new fields show up. >> > >> > My comments from that thread still apply. Adding "revisions" or feature flags >> > isn't maintanable, e.g. at best KVM will end up with a ridiculous number of flags. >> > >> > Looking at QEMU, which I strongly suspect is the only VMM that enables >> > KVM_CAP_HYPERV_ENLIGHTENED_VMCS, it does the sane thing of enabling the capability >> > before grabbing the VMX MSRs. >> > >> > So, why not simply apply filtering for host accesses as well? >> >> (I understand that using QEMU to justify KVM's behavior is flawed but...) >> >> QEMU's migration depends on the assumption that identical QEMU's command >> lines create identical (from guest PoV) configurations. Assume we have >> (simplified) >> >> "-cpu CascadeLake-Sever,hv-evmcs" >> >> on both source and destination but source host is newer, i.e. its KVM >> knows about TSC Scaling in eVMCS and destination host has no idea about >> it. If we just apply filtering upon vCPU creation, guest visible MSR >> values are going to be different, right? Ok, assuming QEMU also migrates >> VMX feature MSRs (TODO: check if that's true), we will be able to fail >> mirgration late (which is already much worse than not being able to >> create the desired configuration on destination, 'fail early') if we use >> in-KVM filtering to throw an error to userspace. But if we blindly >> filter control MSRs on the destination, 'TscScaling' will just disapper >> undreneath the guest. This is unlikely to work. > > But all of that holds true irrespetive of eVMCS. If QEMU attempts to migrate a > nested guest from a KVM that supports TSC_SCALING to a KVM that doesn't support > TSC_SCALING, then TSC_SCALING is going to disappear and VM-Entry on the dest will > fail. I.e. QEMU _must_ be able to detect the incompatibility and not attempt > the migration. And with that code in place, QEMU doesn't need to do anything new > for eVMCS, it Just Works. I'm obviously missing something. "-cpu CascadeLake-Sever" presumes cetain features, including VMX features (e.g. TSC_SCALING), an attempt to create such vCPU on a CPU which doesn't support it will lead to immediate failure. So two VMs created on different hosts with -cpu CascadeLake-Sever are guaranteed to look exactly the same from guest PoV. This is not true for '-cpu CascadeLake-Sever,hv-evmcs' (if we do it the way you suggest) as 'hv-evmcs' will be a *different* filter on each host (which is going to depend on KVM version, not even on the host's hardware). > >> In any case, what we need, is an option for VMM (read: QEMU) to create >> the configuration with 'TscScaling' filtered out even KVM supports the >> bit in eVMCS. This way the guest will be able to migrate backwards to an >> older KVM which doesn't support it, i.e. >> >> '-cpu CascadeLake-Sever,hv-evmcs' >> creates the 'origin' eVMCS configuration, no TscScaling >> >> '-cpu CascadeLake-Sever,hv-evmcs,hv-evmcs-2022' creates the updated one. > > Again, this conundrum exists irrespective of eVMCS. Properly solve the problem > for regular nVMX and eVMCS should naturally work. I don't think we have this problem for VMX features as named CPU models in QEMU encode all of them explicitly, they *must* be present whenever such vCPU is created. > >> KVM_CAP_HYPERV_ENLIGHTENED_VMCS is bad as it only takes 'eVMCS' version >> as a parameter (as we assumed it will always change when new fields are >> added, but that turned out to be false). That's why I suggested >> KVM_CAP_HYPERV_ENLIGHTENED_VMCS2. > > Enumerating features via versions is such a bad API though, e.g. if there's a > bug with nested TSC_SCALING, userspace can't disable just nested TSC_SCALING > without everything else under the inscrutable "hv-evmcs-2022" being collateral > damage. Why? Something like "-cpu CascadeLake-Sever,hv-evmcs,hv-evmcs-2022,-vmx-tsc-scaling" should work well, no? 'hv-evmcs*' are just filters, if the VMX feature is not there -- it's not there. We can certainly make this fine-grained and introduce KVM_CAP_HYPERV_EVMCS_TSC_SCALING instead and make a 'hv-*' flag for it, however, with Hyper-V and Windows I really don't like 'frankenstein' Hyper-V configurations which look nothing like genuine Hyper-V as nobody at Microsoft tests new Windows builds with such configurations. And yes, bugs were discovered in the past (e.g. PerfGlobalCtrl and Win11). -- Vitaly