KVM needs to save the guest XFD_ERR value before this register might be accessed by the host and restore it before entering the guest. This implementation saves guest XFD_ERR in two transition points: - When the vCPU thread exits to the userspace VMM; - When the vCPU thread is preempted; XFD_ERR is cleared to ZERO right after saving the previous guest value. Otherwise a stale guest value may confuse the host #NM handler to misinterpret a non-XFD-related #NM as XFD related. There is no need to save the host XFD_ERR value because the only place where XFD_ERR is consumed outside of KVM is in #NM handler (which can not be preempted by a vCPU thread). XFD_ERR should always be observed as ZER0 outside of #NM hanlder, thus clearing XFD_ERR meets the host expectation here. The saved guest value is restored to XFD_ERR right before entering the guest (with preemption disabled). Current implementation still has two opens which we would like to hear suggestions: 1) Will #NM be triggered in host kernel? Now the code is written assuming above is true, and it's the only reason for saving guest XFD_ERR at preemption time. Otherwise the save is only required when the CPU enters ring-3 (either from the vCPU itself or other threads), by leveraging the "user-return notifier" machinery as suggested by Paolo. 2) When to enable XFD_ERR save/restore? There are four options on the table: a) As long as guest cpuid has xfd enabled XFD_ERR save/restore is enabled in every VM-exit (if preemption or ret-to-userspace happens) b) When the guest sets IA32_XFD to 1 for the first time Indicate that guest OS supports XFD features. Because guest OS usually initializes IA32_XFD at boot time, XFD_ERR save/restore is enabled for almost every VM-exit (if preemption or ret-to- userspace happens). No save/restore for legacy guest OS which doesn't support XFD features at all (thus won't touch IA32_XFD). c) When the guest sets IA32_XFD to 0 for the first time Lazily enabling XFD_ERR save/restore until XFD features are used inside guest. However, this option doesn't work because XFD_ERR is set when #NM is raised. An VM-exit could happen between CPU raising #NM and guest #NM handler reading XFD_ERR (before setting XFD to 0). The very first XFD_ERR might be already clobbered by the host due to no save/restore in that small window. d) When the 1st guest #NM with non-zero XFD_ERR occurs Lazily enabling XFD_ERR save/restore until XFD features are used inside guest. This requires intercepting guest #NM until non-zero XFD_ERR occurs. If a guest with XFD in cpuid never launches an AMX application, it implies that #NM is always trapped thus adding a constant overhead which may be even higher than doing RDMSR in preemption path in a) and b): #preempts < #VMEXITS (no #NM trap) < #VMEXITS (#NM trap) The number of preemptions and ret-to-userspaces should be a small portion of total #VMEXITs in a healthy virtualization environment. Our gut-feeling is that adding at most one MSR read and one MSR write to the preempt/user-ret paths is possibly more efficient than increasing #VMEXITs due to trapping #NM. For above analysis we plan to go option b), although this version currently implements a). But we would like to hear other suggestions before making this change. Signed-off-by: Jing Liu <jing2.liu@xxxxxxxxx> Signed-off-by: Kevin Tian <kevin.tian@xxxxxxxxx> Signed-off-by: Yang Zhong <yang.zhong@xxxxxxxxx> --- arch/x86/kernel/fpu/core.c | 2 ++ arch/x86/kvm/cpuid.c | 5 +++++ arch/x86/kvm/vmx/vmx.c | 2 ++ arch/x86/kvm/vmx/vmx.h | 2 +- arch/x86/kvm/x86.c | 5 +++++ 5 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 5089f2e7dc22..9811dc98d550 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu) fpstate->is_guest = true; gfpu->fpstate = fpstate; + gfpu->xfd_err = XFD_ERR_GUEST_DISABLED; gfpu->user_xfeatures = fpu_user_cfg.default_features; gfpu->user_perm = fpu_user_cfg.default_features; fpu_init_guest_permissions(gfpu); @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest) fpu->fpstate = guest_fps; guest_fps->in_use = true; } else { + fpu_save_guest_xfd_err(guest_fpu); guest_fps->in_use = false; fpu->fpstate = fpu->__task_fpstate; fpu->__task_fpstate = NULL; diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f3c61205bbf4..ea51b986ee67 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) kvm_apic_set_version(vcpu); } + /* Enable saving guest XFD_ERR */ + best = kvm_find_cpuid_entry(vcpu, 7, 0); + if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE)) + vcpu->arch.guest_fpu.xfd_err = 0; + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); if (!best) vcpu->arch.guest_supported_xcr0 = 0; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6198b13c4846..0db8bdf273e2 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -161,6 +161,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = { MSR_GS_BASE, MSR_KERNEL_GS_BASE, MSR_IA32_XFD, + MSR_IA32_XFD_ERR, #endif MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, @@ -7153,6 +7154,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu) static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu) { vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_R, false); + vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_RW, false); } static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index bf9d3051cd6c..0a00242a91e7 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -340,7 +340,7 @@ struct vcpu_vmx { struct lbr_desc lbr_desc; /* Save desired MSR intercept (read: pass-through) state */ -#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14 +#define MAX_POSSIBLE_PASSTHROUGH_MSRS 15 struct { DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS); DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d127b229dd29..8b033c9241d6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) kvm_steal_time_set_preempted(vcpu); srcu_read_unlock(&vcpu->kvm->srcu, idx); + if (vcpu->preempted) + fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu); + static_call(kvm_x86_vcpu_put)(vcpu); vcpu->arch.last_host_tsc = rdtsc(); } @@ -9951,6 +9954,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (test_thread_flag(TIF_NEED_FPU_LOAD)) switch_fpu_return(); + fpu_restore_guest_xfd_err(&vcpu->arch.guest_fpu); + if (unlikely(vcpu->arch.switch_db_regs)) { set_debugreg(0, 7); set_debugreg(vcpu->arch.eff_db[0], 0);