On Tue, Sep 26, 2023, Sean Christopherson wrote: > On Mon, Sep 25, 2023, Tyler Stachecki wrote: > > > @@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate, > > > * It supports partial copy but @to.pos always starts from zero. > > > */ > > > void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, > > > - u32 pkru_val, enum xstate_copy_mode copy_mode) > > > + u64 xfeatures, u32 pkru_val, > > > + enum xstate_copy_mode copy_mode) > > > { > > > const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr); > > > struct xregs_state *xinit = &init_fpstate.regs.xsave; > > > @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, > > > break; > > > > > > case XSTATE_COPY_XSAVE: > > > - header.xfeatures &= fpstate->user_xfeatures; > > > + header.xfeatures &= fpstate->user_xfeatures & xfeatures; > > > break; > > > > This changes the consideration of the xfeatures copied *into* the uabi buffer > > with respect to the guest xfeatures IIUC (approx guest XCR0, less FP/SSE only). > > > > IOW: we are still trimming guest xfeatures, just at the source...? > > KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been KVM's > ABI for a very long time, and userspace absolutely relies on that functionality > to ensure that a VM can be migrated within a pool of heterogenous systems so long > as the features that are *exposed* to the guest are supported on all platforms. > > Before the big FPU rework, KVM manually filled xregs_state and directly "trimmed" > based on the guest supported XCR0 (see fill_xsave() below). > > The major difference between my proposed patch and the current code is that KVM > trims *only* at KVM_GET_XSAVE{2}, which in addition to being KVM's historical ABI, > (see load_xsave() below), ensures that the *destination* can load state irrespective > of the possibly-not-yet-defined guest CPUID model. ... > > That being said: the patch that I gave siliently allows unacceptable things to > > be accepted at the destination, whereas this maintains status-quo and signals > > an error when the destination cannot wholly process the uabi buffer (i.e., > > asked to restore more state than the destination processor has present). > > > > The downside of my approach is above -- the flip side, though is that this > > approach requires a patch to be applied on the source. However, we cannot > > apply a patch at the source until it is evacuated of VMs -- chicken and egg > > problem... > > > > Unless I am grossly misunderstanding things here -- forgive me... :-) > > I suspect you're overlooking the impact on the destination. Trimming features > only when saving XSAVE state into the userspace buffer means that the destination > will play nice with the "bad" snapshot. It won't help setups where a VM is being > migrated from a host with more XSAVE features to a host with fewer XSAVE features, > but I don't see a sane way to retroactively fix that situation. And IIUC, that's > not the situation you're in (unless the Skylake host vs. Broadwell guests is only > the test environment). One clarification: this comment from Tyler's proposed patch is somewhat misleading. I do think KVM should filter state only at KVM_GET_XSAVE, because otherwise any off-by-default feature will have different behavior than on-by-default features, e.g. restoring AMX state requires KVM_SET_CPUID, whereas every other features depends only on host capabilities. + /* + * In previous kernels, kvm_arch_vcpu_create() set the guest's fpstate + * based on what the host CPU supported. Recent kernels changed this + * and only accept ustate containing xfeatures that the guest CPU is + * capable of supporting. + */ + ustate->xsave.header.xfeatures &= user_xfeatures; It's only the "realloc" path used when granting access to an off-by-default feature, i.e. AMX, that skips initialize user_xfeatures. __fpstate_reset() does initialize user_xfeatures to the default set for guest state, i.e. the problem is not in kvm_arch_vcpu_create(). The problem is specifically that KVM rejects KVM_SET_XSAVE if userspace limits guest supported xfeatures via KVM_SET_CPUID. In a perfect world, that would be desirable, e.g. KVM's ABI when loading MSRs from userspace is to (mostly) honor guest CPUID, but again this is a clear breakage of userspace. I.e. QEMU does KVM_SET_CPUID2, and *then* KVM_SET_XSAVE, which fails. If QEMU reversed those, KVM_SET_XSAVE would actually succeed. There's another related oddity that will be fixed by my approach, assuming the realloc change is also reverted (I missed that in my pasted patch). Userspace *must* do KVM_SET_CPUID{2} in order to load off-by-default state, whereas there is no such requirement for on-by-default state.