On 10/14/2021 5:01 PM, Paolo Bonzini wrote: > On 14/10/21 10:02, Liu, Jing2 wrote: > >> In principle I don't like it very much; it would be nicer to say "you > >> enable it for QEMU itself via arch_prctl(ARCH_SET_STATE_ENABLE), and > >> for the guests via ioctl(KVM_SET_CPUID2)". But I can see why you > >> want to keep things simple, so it's not a strong objection at all. > > > > Does this mean that KVM allocate 3 buffers via > > 1) Qemu's request, instead of via 2) guest XCR0 trap? > > Based on the input from Andy and Thomas, the new way would be like this: > > 1) host_fpu must always be checked for reallocation in kvm_load_guest_fpu > (or in the FPU functions that it calls, that depends on the rest of Thomas's > patches). That's because arch_prctl can enable AMX for QEMU at any point > after KVM_CREATE_VCPU. For Qemu's XFD, I'd like to confirm that: Since the arch_prctl() onlys add current->group_leader->thread.fpu's state_perm, __state_size, (current->thread.fpu.* is not changed), thus in kvm_load_guest_fpu, host_fpu->xfd is always 1. That is to say, Qemu's arch_prctl() doesn't change any copies of XFD. > > 2) every use of vcpu->arch.guest_supported_xcr0 is changed to only include > those dynamic-feature bits that were enabled via arch_prctl. > That is, something like: > > static u64 kvm_guest_supported_cr0(struct kvm_vcpu *vcpu) { > return vcpu->arch.guest_supported_xcr0 & > (~xfeatures_mask_user_dynamic | \ > current->thread.fpu.dynamic_state_perm); > } > > 3) Even with passthrough disabled, the guest can run with XFD set to > vcpu->arch.guest_xfd (and likewise for XFD_ERR) which is much simpler > than trapping #NM. The traps for writing XCR0 and XFD are used to allocate > dynamic state for guest_fpu, and start the passthrough of XFD and XFD_ERR. > What we need is: > > - if a dynamic state has XCR0[n]=0, bit n will never be set in XFD_ERR and the > state will never be dirtied by the guest. > > - if a dynamic state has XCR0[n]=1, but all enabled dynamic states have > XFD[n]=1, the guest is not able to dirty any dynamic XSAVE state, because > they all have either XCR0[n]=0 or XFD[n]=1. An attempt to do so will cause an > #NM trap and set the bit in XFD_ERR. > > - if a dynamic state has XCR0[n]=1 and XFD[n]=0, the state for bit n is > allocated in guest_fpu, and it can also disable the vmexits for XFD and > XFD_ERR. > Got it, the principle is once XCR0[n]=1 and XFD[n]=0, then guest is allowed to use the dynamic XSAVE state, thus KVM must prepare all things well before. This probably happens shortly after guest #NM. Only one thing: it seems we assume that vcpu->arch.xfd is guest runtime value. And before guest initializes XFD, KVM provides vcpu->arch.xfd[18]=1, right? But the spec asks XFD reset value as zero. If so, between guest init XCR0 to 1 and init XFD to 1, it's XCR0[n]=1 and XFD[n]=0. If a guest never init XFD and directly use dynamic state... Or do we want to provide guest a XFD[18]=1 value at the very beginning? > Therefore: > > - if passthrough is disabled, the XCR0 and XFD write traps can check > guest_xcr0 & ~guest_xfd. If it includes a dynamic state bit, dynamic state is > allocated for all bits enabled in guest_xcr0 and passthrough is started; this > should happen shortly after the guest gets its first #NM trap for AMX. > > - if passthrough is enabled, the XCR0 write trap must still ensure that > dynamic state is allocated for all bits enabled in guest_xcr0. > > So something like this pseudocode is called by both XCR0 and XFD writes: > > int kvm_alloc_fpu_dynamic_features(struct kvm_vcpu *vcpu) { > u64 allowed_dynamic = current->thread.fpu.dynamic_state_perm; > u64 enabled_dynamic = > vcpu->arch.xcr0 & xfeatures_mask_user_dynamic; > > /* All dynamic features have to be arch_prctl'd first. */ > WARN_ON_ONCE(enabled_dynamic & ~allowed_dynamic); > > if (!vcpu->arch.xfd_passthrough) { > /* All dynamic states will #NM? Wait and see. */ > if ((enabled_dynamic & ~vcpu->arch.xfd) == 0) Here, when guest init XCR0 to 1, vcpu->arch.xfd should be 1 otherwise XCR0 trap makes passthrough and allocates buffer, which is not what we want. > return 0; > > kvm_x86_ops.enable_xfd_passthrough(vcpu); > } > > /* current->thread.fpu was already handled by arch_prctl. */ It seems so far, arch_prctl does not change current->thread.fpu, only #NM handler itself does it. We here alloc current too. Thanks, Jing > return fpu_alloc_features(vcpu->guest_fpu, > vcpu->guest_fpu.dynamic_state_perm | enabled_dynamic); } > > Paolo