Re: [PATCH v2] KVM: SEV-ES: Don't intercept MSR_IA32_DEBUGCTLMSR for SEV-ES guests

Ravi Bangoria <ravi.bangoria@xxxxxxx> · Fri, 17 May 2024 11:48:27 +0530

Hi Sean,

Apologies for the delay in reply.

On 08-May-24 12:37 AM, Sean Christopherson wrote:
> On Mon, May 06, 2024, Ravi Bangoria wrote:
>> On 03-May-24 5:21 AM, Sean Christopherson wrote:
>>> On Tue, Apr 16, 2024, Ravi Bangoria wrote:
>>>> Currently, LBR Virtualization is dynamically enabled and disabled for
>>>> a vcpu by intercepting writes to MSR_IA32_DEBUGCTLMSR. This helps by
>>>> avoiding unnecessary save/restore of LBR MSRs when nobody is using it
>>>> in the guest. However, SEV-ES guest mandates LBR Virtualization to be
>>>> _always_ ON[1] and thus this dynamic toggling doesn't work for SEV-ES
>>>> guest, in fact it results into fatal error:
>>>>
>>>> SEV-ES guest on Zen3, kvm-amd.ko loaded with lbrv=1
>>>>
>>>>   [guest ~]# wrmsr 0x1d9 0x4
>>>>   KVM: entry failed, hardware error 0xffffffff
>>>>   EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000
>>>>   ...
>>>>
>>>> Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests.
>>>
>>> Uh, what?  I mean, sure, it works, maybe, I dunno.  But there's a _massive_
>>> disconnect between the first paragraph and this statement.
>>>
>>> Oh, good gravy, it "works" because SEV already forces LBR virtualization.
>>>
>>> 	svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK;
>>>
>>> (a) the changelog needs to call that out.
>>
>> Sorry, I should have called that out explicitly.
>>
>>>  (b) KVM needs to disallow SEV-ES if
>>> LBR virtualization is disabled by the admin, i.e. if lbrv=false.
>>
>> That's what I initially thought. But since KVM currently allows booting SEV-ES
>> guests even when lbrv=0 (by silently ignoring lbrv value), erroring out would
>> be a behavior change.
> 
> IMO, that's totally fine.  There are no hard guarantees regarding module params,

Sure. I will prepare a patch to remove lbrv module parameter.

>>> Alternatively, I would be a-ok simply deleting lbrv, e.g. to avoid yet more
>>> printks about why SEV-ES couldn't be enabled.
>>>
>>> Hmm, I'd probably be more than ok.  Because AMD (thankfully, blessedly) uses CPUID
>>> bits for SVM features, the admin can disable LBRV via clear_cpuid (or whatever it's
>>> called now).  And there are hardly any checks on the feature, so it's not like
>>> having a boolean saves anything.  AMD is clearly committed to making sure LBRV
>>> works, so the odds of KVM really getting much value out of a module param is low.
>>
>> Currently, lbrv is not enabled by default with model specific -cpu profiles in
>> qemu. So I guess this is not backward compatible?
> 
> I am talking about LBRV being disabled in the _host_ kernel, not guest CPUID.
> QEMU enabling LBRV only affects nested SVM, which is out of scope for SEV-ES.

Got it.

>>> And then when you delete lbrv, please add a WARN_ON_ONCE() sanity check in
>>> sev_hardware_setup() (if SEV-ES is supported), because like the DECODEASSISTS
>>> and FLUSHBYASID requirements, it's not super obvious that LBRV is a hard
>>> requirement for SEV-ES (that's an understatment; I'm curious how some decided
>>> that LBR virtualization is where the line go drawn for "yeah, _this_ is mandatory").
>>
>> I'm not sure. Some ES internal dependency.
>>
>> In any case, the patch simply fixes 'missed clearing MSR Interception' for
>> SEV-ES guests. So, would it be okay to apply this patch as is and do lbrv
>> cleanup as a followup series?
> 
> No.
> 
> (a) the lbrv module param mess needs to be sorted out.
> (b) this is not a complete fix.
> (c) I'm not convinced it's the right way to fix this, at all.
> (d) there's a big gaping hole in KVM's handling of MSRs that are passed through
>     to SEV-ES guests.
> (e) it's not clear to me that KVM needs to dynamically toggle LBRV for _any_ guest.
> (f) I don't like that sev_es_init_vmcb() mucks with the LBRV intercepts without
>     using svm_enable_lbrv().
> 
> Unless I'm missing something, KVM allows userspace to get/set MSRs for SEV-ES
> guests, even after the VMSA is encrypted.  E.g. a naive userspace could attempt
> to migrate MSR_IA32_DEBUGCTLMSR and end up unintentionally disabling LBRV on the
> target.  The proper fix for VMSA being encrypted is to likely to disallow
> KVM_{G,S}ET_MSR on MSRs that are contexted switched via the VMSA.
> 
> But that doesn't address the issue where KVM will disable LBRV if userspace
> sets MSR_IA32_DEBUGCTLMSR before the VMSA is encrypted.  The easiest fix for
> that is to have svm_disable_lbrv() do nothing for SEV-ES guests, but I'm not
> convinced that's the best fix.

Agreed, 1) KVM_GET/SET_MSR for SEV-ES guest after VMSA encrypted and 2) the
window between setting LBRV to VMSA encryption, both are valid issues. I've
prepared a draft patch, attached at the end, can you please review.

> AFAICT, host perf doesn't use the relevant MSRs, and even if host perf did use
> the MSRs, IIUC there is no "stack", and #VMEXIT retains the guest values for
> non-SEV-ES guests.  I.e. functionally, running with and without LBRV would be
> largely equivalent as far as perf is concerned.  The guest could scribble an MSR
> with garbage, but overall, host perf wouldn't be meaningfully affected by LBRV.

FWIW, AMD has multiple versions of LBRs with virt support:
  - Legacy LBR (1 deep, No freeze support on PMI)
  - LBR Stack (16 deep, Has freeze support on PMI).

Both are independent and perf uses only LBR Stack.

> So unless I'm missing something, the only reason to ever disable LBRV would be
> for performance reasons.  Indeed the original commits more or less says as much:
> 
>   commit 24e09cbf480a72f9c952af4ca77b159503dca44b
>   Author:     Joerg Roedel <joerg.roedel@xxxxxxx>
>   AuthorDate: Wed Feb 13 18:58:47 2008 +0100
> 
>     KVM: SVM: enable LBR virtualization
>     
>     This patch implements the Last Branch Record Virtualization (LBRV) feature of
>     the AMD Barcelona and Phenom processors into the kvm-amd module. It will only
>     be enabled if the guest enables last branch recording in the DEBUG_CTL MSR. So
>     there is no increased world switch overhead when the guest doesn't use these
>     MSRs.
> 
> but what it _doesn't_ say is what the world switch overhead is when LBRV is
> enabled.  If the overhead is small, e.g. 20 cycles?, then I see no reason to
> keep the dynamically toggling.
> 
> And if we ditch the dynamic toggling, then this patch is unnecessary to fix the
> LBRV issue.  It _is_ necessary to actually let the guest use the LBRs, but that's
> a wildly different changelog and justification.

The overhead might be less for legacy LBR. But upcoming hw also supports
LBR Stack Virtualization[1]. LBR Stack has total 34 MSRs (two control and
16*2 stack). Also, Legacy and Stack LBR virtualization both are controlled
through the same VMCB bit. So I think I still need to keep the dynamic
toggling for LBR Stack virtualization.

[1] AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
    2023, Vol 2, 15.23 Last Branch Record Virtualization

> And if we _don't_ ditch the dynamic toggling, then sev_es_init_vmcb() should be
> using svm_enable_lbrv(), not open coding the exact same thing.

Agreed. The patch below covers this change.

---

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 759581bb2128..7e549ca0a4e9 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -666,6 +666,14 @@ static int __sev_launch_update_vmsa(struct kvm *kvm, struct kvm_vcpu *vcpu,
 	  return ret;
 
 	vcpu->arch.guest_state_protected = true;
+
+	/*
+	 * SEV-ES guest mandates LBR Virtualization to be _always_ ON. Enable
+	 * it after setting guest_state_protected because KVM_SET_MSRS allows
+	 * dynamic toggeling of LBRV (for performance reason) on write access
+	 * to MSR_IA32_DEBUGCTLMSR when guest_state_protected is not set.
+	 */
+	svm_enable_lbrv(vcpu);
 	return 0;
 }
 
@@ -3034,7 +3042,6 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm)
 	struct kvm_vcpu *vcpu = &svm->vcpu;
 
 	svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ES_ENABLE;
-	svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK;
 
 	/*
 	 * An SEV-ES guest requires a VMSA area that is a separate from the
@@ -3086,10 +3093,6 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm)
 	/* Clear intercepts on selected MSRs */
 	set_msr_interception(vcpu, svm->msrpm, MSR_EFER, 1, 1);
 	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_CR_PAT, 1, 1);
-	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1);
-	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1);
-	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
-	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
 }
 
 void sev_init_vmcb(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9aaf83c8d57d..4a8bd32dfa96 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -99,6 +99,7 @@ static const struct svm_direct_access_msrs {
 	{ .index = MSR_IA32_SPEC_CTRL,			.always = false },
 	{ .index = MSR_IA32_PRED_CMD,			.always = false },
 	{ .index = MSR_IA32_FLUSH_CMD,			.always = false },
+	{ .index = MSR_IA32_DEBUGCTLMSR,		.always = false },
 	{ .index = MSR_IA32_LASTBRANCHFROMIP,		.always = false },
 	{ .index = MSR_IA32_LASTBRANCHTOIP,		.always = false },
 	{ .index = MSR_IA32_LASTINTFROMIP,		.always = false },
@@ -990,7 +991,7 @@ void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb)
 	vmcb_mark_dirty(to_vmcb, VMCB_LBR);
 }
 
-static void svm_enable_lbrv(struct kvm_vcpu *vcpu)
+void svm_enable_lbrv(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -1000,6 +1001,9 @@ static void svm_enable_lbrv(struct kvm_vcpu *vcpu)
 	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
 	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
 
+	if (sev_es_guest(vcpu->kvm))
+		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_DEBUGCTLMSR, 1, 1);
+
 	/* Move the LBR msrs to the vmcb02 so that the guest can see them. */
 	if (is_guest_mode(vcpu))
 		svm_copy_lbrs(svm->vmcb, svm->vmcb01.ptr);
@@ -1009,6 +1013,8 @@ static void svm_disable_lbrv(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	KVM_BUG_ON(sev_es_guest(vcpu->kvm), vcpu->kvm);
+
 	svm->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK;
 	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0);
 	set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0);
@@ -2821,10 +2827,24 @@ static int svm_get_msr_feature(struct kvm_msr_entry *msr)
 	return 0;
 }
 
+static bool
+sev_es_prevent_msr_access(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	return sev_es_guest(vcpu->kvm) &&
+	       vcpu->arch.guest_state_protected &&
+	       svm_msrpm_offset(msr_info->index) != MSR_INVALID &&
+	       !msr_write_intercepted(vcpu, msr_info->index);
+}
+
 static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	if (sev_es_prevent_msr_access(vcpu, msr_info)) {
+		msr_info->data = 0;
+		return 0;
+	}
+
 	switch (msr_info->index) {
 	case MSR_AMD64_TSC_RATIO:
 		if (!msr_info->host_initiated &&
@@ -2975,6 +2995,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 
 	u32 ecx = msr->index;
 	u64 data = msr->data;
+
+	if (sev_es_prevent_msr_access(vcpu, msr))
+		return 0;
+
 	switch (ecx) {
 	case MSR_AMD64_TSC_RATIO:
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 33878efdebc8..7b2c55dd8242 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -30,7 +30,7 @@
 #define	IOPM_SIZE PAGE_SIZE * 3
 #define	MSRPM_SIZE PAGE_SIZE * 2
 
-#define MAX_DIRECT_ACCESS_MSRS	47
+#define MAX_DIRECT_ACCESS_MSRS	48
 #define MSRPM_OFFSETS	32
 extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
 extern bool npt_enabled;
@@ -543,6 +543,7 @@ u32 *svm_vcpu_alloc_msrpm(void);
 void svm_vcpu_init_msrpm(struct kvm_vcpu *vcpu, u32 *msrpm);
 void svm_vcpu_free_msrpm(u32 *msrpm);
 void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb);
+void svm_enable_lbrv(struct kvm_vcpu *vcpu);
 void svm_update_lbrv(struct kvm_vcpu *vcpu);
 
 int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer);

---

This is just a draft patch. I'll add logic to bail out SEV-ES guest creation
when LBRV is not supported by host, remove lbrv module parameter etc.

Thanks,
Ravi