Hi Sean, Apologies for the delay in reply. On 08-May-24 12:37 AM, Sean Christopherson wrote: > On Mon, May 06, 2024, Ravi Bangoria wrote: >> On 03-May-24 5:21 AM, Sean Christopherson wrote: >>> On Tue, Apr 16, 2024, Ravi Bangoria wrote: >>>> Currently, LBR Virtualization is dynamically enabled and disabled for >>>> a vcpu by intercepting writes to MSR_IA32_DEBUGCTLMSR. This helps by >>>> avoiding unnecessary save/restore of LBR MSRs when nobody is using it >>>> in the guest. However, SEV-ES guest mandates LBR Virtualization to be >>>> _always_ ON[1] and thus this dynamic toggling doesn't work for SEV-ES >>>> guest, in fact it results into fatal error: >>>> >>>> SEV-ES guest on Zen3, kvm-amd.ko loaded with lbrv=1 >>>> >>>> [guest ~]# wrmsr 0x1d9 0x4 >>>> KVM: entry failed, hardware error 0xffffffff >>>> EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000 >>>> ... >>>> >>>> Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests. >>> >>> Uh, what? I mean, sure, it works, maybe, I dunno. But there's a _massive_ >>> disconnect between the first paragraph and this statement. >>> >>> Oh, good gravy, it "works" because SEV already forces LBR virtualization. >>> >>> svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK; >>> >>> (a) the changelog needs to call that out. >> >> Sorry, I should have called that out explicitly. >> >>> (b) KVM needs to disallow SEV-ES if >>> LBR virtualization is disabled by the admin, i.e. if lbrv=false. >> >> That's what I initially thought. But since KVM currently allows booting SEV-ES >> guests even when lbrv=0 (by silently ignoring lbrv value), erroring out would >> be a behavior change. > > IMO, that's totally fine. There are no hard guarantees regarding module params, Sure. I will prepare a patch to remove lbrv module parameter. >>> Alternatively, I would be a-ok simply deleting lbrv, e.g. to avoid yet more >>> printks about why SEV-ES couldn't be enabled. >>> >>> Hmm, I'd probably be more than ok. Because AMD (thankfully, blessedly) uses CPUID >>> bits for SVM features, the admin can disable LBRV via clear_cpuid (or whatever it's >>> called now). And there are hardly any checks on the feature, so it's not like >>> having a boolean saves anything. AMD is clearly committed to making sure LBRV >>> works, so the odds of KVM really getting much value out of a module param is low. >> >> Currently, lbrv is not enabled by default with model specific -cpu profiles in >> qemu. So I guess this is not backward compatible? > > I am talking about LBRV being disabled in the _host_ kernel, not guest CPUID. > QEMU enabling LBRV only affects nested SVM, which is out of scope for SEV-ES. Got it. >>> And then when you delete lbrv, please add a WARN_ON_ONCE() sanity check in >>> sev_hardware_setup() (if SEV-ES is supported), because like the DECODEASSISTS >>> and FLUSHBYASID requirements, it's not super obvious that LBRV is a hard >>> requirement for SEV-ES (that's an understatment; I'm curious how some decided >>> that LBR virtualization is where the line go drawn for "yeah, _this_ is mandatory"). >> >> I'm not sure. Some ES internal dependency. >> >> In any case, the patch simply fixes 'missed clearing MSR Interception' for >> SEV-ES guests. So, would it be okay to apply this patch as is and do lbrv >> cleanup as a followup series? > > No. > > (a) the lbrv module param mess needs to be sorted out. > (b) this is not a complete fix. > (c) I'm not convinced it's the right way to fix this, at all. > (d) there's a big gaping hole in KVM's handling of MSRs that are passed through > to SEV-ES guests. > (e) it's not clear to me that KVM needs to dynamically toggle LBRV for _any_ guest. > (f) I don't like that sev_es_init_vmcb() mucks with the LBRV intercepts without > using svm_enable_lbrv(). > > Unless I'm missing something, KVM allows userspace to get/set MSRs for SEV-ES > guests, even after the VMSA is encrypted. E.g. a naive userspace could attempt > to migrate MSR_IA32_DEBUGCTLMSR and end up unintentionally disabling LBRV on the > target. The proper fix for VMSA being encrypted is to likely to disallow > KVM_{G,S}ET_MSR on MSRs that are contexted switched via the VMSA. > > But that doesn't address the issue where KVM will disable LBRV if userspace > sets MSR_IA32_DEBUGCTLMSR before the VMSA is encrypted. The easiest fix for > that is to have svm_disable_lbrv() do nothing for SEV-ES guests, but I'm not > convinced that's the best fix. Agreed, 1) KVM_GET/SET_MSR for SEV-ES guest after VMSA encrypted and 2) the window between setting LBRV to VMSA encryption, both are valid issues. I've prepared a draft patch, attached at the end, can you please review. > AFAICT, host perf doesn't use the relevant MSRs, and even if host perf did use > the MSRs, IIUC there is no "stack", and #VMEXIT retains the guest values for > non-SEV-ES guests. I.e. functionally, running with and without LBRV would be > largely equivalent as far as perf is concerned. The guest could scribble an MSR > with garbage, but overall, host perf wouldn't be meaningfully affected by LBRV. FWIW, AMD has multiple versions of LBRs with virt support: - Legacy LBR (1 deep, No freeze support on PMI) - LBR Stack (16 deep, Has freeze support on PMI). Both are independent and perf uses only LBR Stack. > So unless I'm missing something, the only reason to ever disable LBRV would be > for performance reasons. Indeed the original commits more or less says as much: > > commit 24e09cbf480a72f9c952af4ca77b159503dca44b > Author: Joerg Roedel <joerg.roedel@xxxxxxx> > AuthorDate: Wed Feb 13 18:58:47 2008 +0100 > > KVM: SVM: enable LBR virtualization > > This patch implements the Last Branch Record Virtualization (LBRV) feature of > the AMD Barcelona and Phenom processors into the kvm-amd module. It will only > be enabled if the guest enables last branch recording in the DEBUG_CTL MSR. So > there is no increased world switch overhead when the guest doesn't use these > MSRs. > > but what it _doesn't_ say is what the world switch overhead is when LBRV is > enabled. If the overhead is small, e.g. 20 cycles?, then I see no reason to > keep the dynamically toggling. > > And if we ditch the dynamic toggling, then this patch is unnecessary to fix the > LBRV issue. It _is_ necessary to actually let the guest use the LBRs, but that's > a wildly different changelog and justification. The overhead might be less for legacy LBR. But upcoming hw also supports LBR Stack Virtualization[1]. LBR Stack has total 34 MSRs (two control and 16*2 stack). Also, Legacy and Stack LBR virtualization both are controlled through the same VMCB bit. So I think I still need to keep the dynamic toggling for LBR Stack virtualization. [1] AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June 2023, Vol 2, 15.23 Last Branch Record Virtualization > And if we _don't_ ditch the dynamic toggling, then sev_es_init_vmcb() should be > using svm_enable_lbrv(), not open coding the exact same thing. Agreed. The patch below covers this change. --- diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 759581bb2128..7e549ca0a4e9 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -666,6 +666,14 @@ static int __sev_launch_update_vmsa(struct kvm *kvm, struct kvm_vcpu *vcpu, return ret; vcpu->arch.guest_state_protected = true; + + /* + * SEV-ES guest mandates LBR Virtualization to be _always_ ON. Enable + * it after setting guest_state_protected because KVM_SET_MSRS allows + * dynamic toggeling of LBRV (for performance reason) on write access + * to MSR_IA32_DEBUGCTLMSR when guest_state_protected is not set. + */ + svm_enable_lbrv(vcpu); return 0; } @@ -3034,7 +3042,6 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm) struct kvm_vcpu *vcpu = &svm->vcpu; svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ES_ENABLE; - svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK; /* * An SEV-ES guest requires a VMSA area that is a separate from the @@ -3086,10 +3093,6 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm) /* Clear intercepts on selected MSRs */ set_msr_interception(vcpu, svm->msrpm, MSR_EFER, 1, 1); set_msr_interception(vcpu, svm->msrpm, MSR_IA32_CR_PAT, 1, 1); - set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1); - set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1); - set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1); - set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1); } void sev_init_vmcb(struct vcpu_svm *svm) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9aaf83c8d57d..4a8bd32dfa96 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -99,6 +99,7 @@ static const struct svm_direct_access_msrs { { .index = MSR_IA32_SPEC_CTRL, .always = false }, { .index = MSR_IA32_PRED_CMD, .always = false }, { .index = MSR_IA32_FLUSH_CMD, .always = false }, + { .index = MSR_IA32_DEBUGCTLMSR, .always = false }, { .index = MSR_IA32_LASTBRANCHFROMIP, .always = false }, { .index = MSR_IA32_LASTBRANCHTOIP, .always = false }, { .index = MSR_IA32_LASTINTFROMIP, .always = false }, @@ -990,7 +991,7 @@ void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb) vmcb_mark_dirty(to_vmcb, VMCB_LBR); } -static void svm_enable_lbrv(struct kvm_vcpu *vcpu) +void svm_enable_lbrv(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -1000,6 +1001,9 @@ static void svm_enable_lbrv(struct kvm_vcpu *vcpu) set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1); set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1); + if (sev_es_guest(vcpu->kvm)) + set_msr_interception(vcpu, svm->msrpm, MSR_IA32_DEBUGCTLMSR, 1, 1); + /* Move the LBR msrs to the vmcb02 so that the guest can see them. */ if (is_guest_mode(vcpu)) svm_copy_lbrs(svm->vmcb, svm->vmcb01.ptr); @@ -1009,6 +1013,8 @@ static void svm_disable_lbrv(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); + KVM_BUG_ON(sev_es_guest(vcpu->kvm), vcpu->kvm); + svm->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK; set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0); set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0); @@ -2821,10 +2827,24 @@ static int svm_get_msr_feature(struct kvm_msr_entry *msr) return 0; } +static bool +sev_es_prevent_msr_access(struct kvm_vcpu *vcpu, struct msr_data *msr_info) +{ + return sev_es_guest(vcpu->kvm) && + vcpu->arch.guest_state_protected && + svm_msrpm_offset(msr_info->index) != MSR_INVALID && + !msr_write_intercepted(vcpu, msr_info->index); +} + static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct vcpu_svm *svm = to_svm(vcpu); + if (sev_es_prevent_msr_access(vcpu, msr_info)) { + msr_info->data = 0; + return 0; + } + switch (msr_info->index) { case MSR_AMD64_TSC_RATIO: if (!msr_info->host_initiated && @@ -2975,6 +2995,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) u32 ecx = msr->index; u64 data = msr->data; + + if (sev_es_prevent_msr_access(vcpu, msr)) + return 0; + switch (ecx) { case MSR_AMD64_TSC_RATIO: diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 33878efdebc8..7b2c55dd8242 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -30,7 +30,7 @@ #define IOPM_SIZE PAGE_SIZE * 3 #define MSRPM_SIZE PAGE_SIZE * 2 -#define MAX_DIRECT_ACCESS_MSRS 47 +#define MAX_DIRECT_ACCESS_MSRS 48 #define MSRPM_OFFSETS 32 extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; extern bool npt_enabled; @@ -543,6 +543,7 @@ u32 *svm_vcpu_alloc_msrpm(void); void svm_vcpu_init_msrpm(struct kvm_vcpu *vcpu, u32 *msrpm); void svm_vcpu_free_msrpm(u32 *msrpm); void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb); +void svm_enable_lbrv(struct kvm_vcpu *vcpu); void svm_update_lbrv(struct kvm_vcpu *vcpu); int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer); --- This is just a draft patch. I'll add logic to bail out SEV-ES guest creation when LBRV is not supported by host, remove lbrv module parameter etc. Thanks, Ravi