Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

Like Xu <like.xu.linux@xxxxxxxxx> · Tue, 23 May 2023 20:40:53 +0800

On 4/5/2023 8:00 pm, Roman Kagan wrote:
Performance counters are defined to have width less than 64 bits.  The
vPMU code maintains the counters in u64 variables but assumes the value
to fit within the defined width.  However, for Intel non-full-width
counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
truncated to 32 bits and then sign-extended to full 64 bits.  If a
negative value is set, it's sign-extended to 64 bits, but then in
kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
previous value for overflow detection.

Thanks for reporting this issue. An easier-to-understand fix could be:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e17be25de6ca..51e75f121234 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -718,7 +718,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)

 static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
 {
-	pmc->prev_counter = pmc->counter;
+	pmc->prev_counter = pmc->counter & pmc_bitmask(pmc);
 	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
 	kvm_pmu_request_counter_reprogram(pmc);
 }

Considering that the pmu code uses pmc_bitmask(pmc) everywhere to wrap
around, I would prefer to use this fix above first and then do a more thorough
cleanup based on your below diff. What do you think ?

That previous value is not truncated, so it always evaluates bigger than
the truncated new one, and a PMI is injected.  If the PMI handler writes
a negative counter value itself, the vCPU never quits the PMI loop.

Turns out that Linux PMI handler actually does write the counter with
the value just read with RDPMC, so when no full-width support is exposed
via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
a negative value, it locks up.

Not really sure what the behavioral difference is between "it locks up" and
"the vCPU never quits the PMI loop".


We observed this in the field, for example, when the guest configures
atop to use perfevents and runs two instances of it simultaneously.

A more essential case I found is this:

 kvm_msr: msr_write CTR1 = 0xffffffffffffffea
 rdpmc on host: CTR1, value 0xffffffffffe3
 kvm_exit: vcpu 0 reason EXCEPTION_NMI
 kvm_msr: msr_read CTR1 = 0x83 // nmi_handler

There are two typical issues here:
- the emulated counter value changed from 0xffffffffffffffffea to 0xffffffffffffe3,
 triggering __kvm_perf_overflow(pmc, false);
- PMI-handler should not reset the counter to a value that is easily overflowed,
 in order to avoid overflow here before iret;

Please confirm whether your usage scenarios consist of these two types, or more.


To address the problem, maintain the invariant that the counter value
always fits in the defined bit width, by truncating the received value
in the respective set_msr methods.  For better readability, factor this
out into a helper function, pmc_set_counter, shared by vmx and svm
parts.

Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Roman Kagan <rkagan@xxxxxxxxx>

Tested-by: Like Xu <likexu@xxxxxxxxxxx>
I prefer to use pmc_bitmask(pmc) to wrap around pmc->prev_counter as the first step.

---
  arch/x86/kvm/pmu.h           | 6 ++++++
  arch/x86/kvm/svm/pmu.c       | 2 +-
  arch/x86/kvm/vmx/pmu_intel.c | 4 ++--
  3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 5c7bbf03b599..6a91e1afef5a 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -60,6 +60,12 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
  	return counter & pmc_bitmask(pmc);
  }
  
+static inline void pmc_set_counter(struct kvm_pmc *pmc, u64 val)
+{
+	pmc->counter += val - pmc_read_counter(pmc);
+	pmc->counter &= pmc_bitmask(pmc);
+}
+
  static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
  {
  	if (pmc->perf_event) {
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 5fa939e411d8..f93543d84cfe 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -151,7 +151,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
  	/* MSR_PERFCTRn */
  	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
  	if (pmc) {
-		pmc->counter += data - pmc_read_counter(pmc);
+		pmc_set_counter(pmc, data);
  		pmc_update_sample_period(pmc);
  		return 0;
  	}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 741efe2c497b..51354e3935d4 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -467,11 +467,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
  			if (!msr_info->host_initiated &&
  			    !(msr & MSR_PMC_FULL_WIDTH_BIT))
  				data = (s64)(s32)data;
-			pmc->counter += data - pmc_read_counter(pmc);
+			pmc_set_counter(pmc, data);
  			pmc_update_sample_period(pmc);
  			break;
  		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
-			pmc->counter += data - pmc_read_counter(pmc);
+			pmc_set_counter(pmc, data);
  			pmc_update_sample_period(pmc);
  			break;
  		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {