On Fri, Dec 20, 2024, Adrian Hunter wrote: > On 17/12/24 18:09, Sean Christopherson wrote: > > On Mon, Nov 25, 2024, Adrian Hunter wrote: > > I would rather just use kvm_load_host_xsave_state(), by forcing vcpu->arch.{xcr0,xss} > > to XFAM, with a comment explaining that the TDX module sets XCR0 and XSS prior to > > returning from VP.ENTER. I don't see any justificaton for maintaining a special > > flow for TDX, it's just more work. E.g. TDX is missing the optimization to elide > > WRPKRU if the current value is the same as the host value. > > Not entirely missing since write_pkru() does do that by itself: > > static inline void write_pkru(u32 pkru) > { > if (!cpu_feature_enabled(X86_FEATURE_OSPKE)) > return; > /* > * WRPKRU is relatively expensive compared to RDPKRU. > * Avoid WRPKRU when it would not change the value. > */ > if (pkru != rdpkru()) > wrpkru(pkru); Argh. Well that's a bug. KVM did the right thing, and then the core kernel swizzled things around and got in the way. I'll post this once it's fully tested: From: Sean Christopherson <seanjc@xxxxxxxxxx> Date: Fri, 20 Dec 2024 07:38:39 -0800 Subject: [PATCH] KVM: x86: Avoid double RDPKRU when loading host/guest PKRU Use the raw wrpkru() helper when loading the guest/host's PKRU on switch to/from guest context, as the write_pkru() wrapper incurs an unnecessary rdpkru(). In both paths, KVM is guaranteed to have performed RDPKRU since the last possible write, i.e. KVM has a fresh cache of the current value in hardware. This effectively restores KVM behavior to that of KVM rior to commit c806e88734b9 ("x86/pkeys: Provide *pkru() helpers"), which renamed the raw helper from __write_pkru() => wrpkru(), and turned __write_pkru() into a wrapper. Commit 577ff465f5a6 ("x86/fpu: Only write PKRU if it is different from current") then added the extra RDPKRU to avoid an unnecessary WRPKRU, but completely missed that KVM already optimized away pointless writes. Reported-by: Adrian Hunter <adrian.hunter@xxxxxxxxx> Fixes: 577ff465f5a6 ("x86/fpu: Only write PKRU if it is different from current") Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> --- arch/x86/kvm/x86.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4320647bd78a..9d5cece9260b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1186,7 +1186,7 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu) vcpu->arch.pkru != vcpu->arch.host_pkru && ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) - write_pkru(vcpu->arch.pkru); + wrpkru(vcpu->arch.pkru); } EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state); @@ -1200,7 +1200,7 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) { vcpu->arch.pkru = rdpkru(); if (vcpu->arch.pkru != vcpu->arch.host_pkru) - write_pkru(vcpu->arch.host_pkru); + wrpkru(vcpu->arch.host_pkru); } if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) { base-commit: 13e98294d7cec978e31138d16824f50556a62d17 -- > } > > For TDX, we don't really need rdpkru() since the TDX Module > clears it, so it could be: > > if (pkru) > wrpkru(pkru); Ah, right, there's no need to do RDPKRU because KVM knows it's zero. On top of my previous suggestion: diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index e23cd8231144..9e490fccf073 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -664,11 +664,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu) /* * On return from VP.ENTER, the TDX Module sets XCR0 and XSS to the - * maximal values supported by the guest, so from KVM's perspective, - * those are the guest's values at all times. + * maximal values supported by the guest, and zeroes PKRU, so from + * KVM's perspective, those are the guest's values at all times. */ vcpu->arch.ia32_xss = (kvm_tdx->xfam & XFEATURE_SUPERVISOR_MASK); vcpu->arch.xcr0 = (kvm_tdx->xfam & XFEATURE_USE_MASK); + vcpu->arch.pkru = 0; return 0; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d380837433c6..d2ea7db896ba 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1208,7 +1208,8 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu) if (cpu_feature_enabled(X86_FEATURE_PKU) && ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) || kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) { - vcpu->arch.pkru = rdpkru(); + if (!vcpu->arch.guest_state_protected) + vcpu->arch.pkru = rdpkru(); if (vcpu->arch.pkru != vcpu->arch.host_pkru) write_pkru(vcpu->arch.host_pkru); } > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c > > index 7eff717c9d0d..b49dcf32206b 100644 > > --- a/arch/x86/kvm/vmx/tdx.c > > +++ b/arch/x86/kvm/vmx/tdx.c > > @@ -636,6 +636,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu) > > vcpu->arch.cr0_guest_owned_bits = -1ul; > > vcpu->arch.cr4_guest_owned_bits = -1ul; > > > > + vcpu->arch.cr4 = <maximal value>; > > Sorry for slow reply. Seems fine except maybe CR4 usage. > > TDX Module validates CR4 based on XFAM and scrubs host state > based on XFAM. It seems like we would need to use XFAM to > manufacture a CR4 that we then effectively use as a proxy > instead of just checking XFAM. Yep. > Since only some vcpu->arch.cr4 bits will be meaningful, it also > still leaves the possibility for confusion. IMO, it's less confusing having '0' for CR0 and CR4, while having accurate values for other state. And I'm far more worried about KVM wondering into a bad path because CR0 and/or CR4 are completely wrong. E.g. kvm_mmu.cpu_role would be completely wrong at steady state, the CR4-based runtime CPUID updates would do the wrong thing, and any helper that wraps kvm_is_cr{0,4}_bit_set() would likely do the worst possible thing. > Are you sure you want this? Yeah, pretty sure. It would be nice if the TDX Module exposed guest CR0/CR4 to KVM, a la the traps SEV-ES+ uses, but I think the next best thing is to assume the guest is using all features. > > + vcpu->arch.cr0 = <maximal value, give or take>; > > AFAICT we don't need to care about CR0 Probably not, but having e.g. CR4.PAE/LA57=1 with CR0.PG/PE=0 will be quite weird.