Re: [PATCH] KVM: x86: vPMU: truncate counter value to allowed width

Roman Kagan <rkagan@xxxxxxxxx> · Tue, 23 May 2023 18:42:40 +0200

On Tue, May 23, 2023 at 08:40:53PM +0800, Like Xu wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 4/5/2023 8:00 pm, Roman Kagan wrote:
> > Performance counters are defined to have width less than 64 bits.  The
> > vPMU code maintains the counters in u64 variables but assumes the value
> > to fit within the defined width.  However, for Intel non-full-width
> > counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
> > truncated to 32 bits and then sign-extended to full 64 bits.  If a
> > negative value is set, it's sign-extended to 64 bits, but then in
> > kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
> > previous value for overflow detection.
> 
> Thanks for reporting this issue. An easier-to-understand fix could be:
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index e17be25de6ca..51e75f121234 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -718,7 +718,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
> 
>  static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>  {
> -       pmc->prev_counter = pmc->counter;
> +       pmc->prev_counter = pmc->counter & pmc_bitmask(pmc);
>        pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
>        kvm_pmu_request_counter_reprogram(pmc);
>  }
> 
> Considering that the pmu code uses pmc_bitmask(pmc) everywhere to wrap
> around, I would prefer to use this fix above first and then do a more thorough
> cleanup based on your below diff. What do you think ?

I did exactly this at first.  However, it felt more natural and easier
to reason about and less error-prone going forward, to maintain the
invariant that pmc->counter always fits in the assumed width.

> > That previous value is not truncated, so it always evaluates bigger than
> > the truncated new one, and a PMI is injected.  If the PMI handler writes
> > a negative counter value itself, the vCPU never quits the PMI loop.
> > 
> > Turns out that Linux PMI handler actually does write the counter with
> > the value just read with RDPMC, so when no full-width support is exposed
> > via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
> > a negative value, it locks up.
> 
> Not really sure what the behavioral difference is between "it locks up" and
> "the vCPU never quits the PMI loop".

No difference, the second paragraph just illustrates the specific case
with Linux PMI handler and the system not exposing full-width counter
support so the problematic code path is triggered.

> > We observed this in the field, for example, when the guest configures
> > atop to use perfevents and runs two instances of it simultaneously.
> 
> A more essential case I found is this:
> 
>  kvm_msr: msr_write CTR1 = 0xffffffffffffffea
>  rdpmc on host: CTR1, value 0xffffffffffe3
>  kvm_exit: vcpu 0 reason EXCEPTION_NMI
>  kvm_msr: msr_read CTR1 = 0x83 // nmi_handler
> 
> There are two typical issues here:
> - the emulated counter value changed from 0xffffffffffffffffea to 0xffffffffffffe3,

Strange, why would the counter go backwards (disregarding the extra high
bits)?

>  triggering __kvm_perf_overflow(pmc, false);
> - PMI-handler should not reset the counter to a value that is easily overflowed,
>  in order to avoid overflow here before iret;

This is a legitimate guest behavior, isn't it?  The problem I'm trying
to fix is when the value written is sufficiently far from overflowing,
but it still ends up looking as an overflow and triggers a PMI.

> Please confirm whether your usage scenarios consist of these two types, or more.

The *usage* scenario is the one I wrote above.  I've identified an
easier repro since then:

  perf stat -e instructions -C 0 &
  sleep 1 && perf stat -e instructions,cycles -C 0 sleep 0.1 && kill -INT %

Please also see the kvm-unit-test I posted at
https://lore.kernel.org/kvm/20230504134908.830041-1-rkagan@xxxxxxxxx

> > To address the problem, maintain the invariant that the counter value
> > always fits in the defined bit width, by truncating the received value
> > in the respective set_msr methods.  For better readability, factor this
> > out into a helper function, pmc_set_counter, shared by vmx and svm
> > parts.
> > 
> > Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
> > Signed-off-by: Roman Kagan <rkagan@xxxxxxxxx>
> 
> Tested-by: Like Xu <likexu@xxxxxxxxxxx>
> I prefer to use pmc_bitmask(pmc) to wrap around pmc->prev_counter as the first step.

My preference is to keep the counter within the limits all the time, so
I'd leave it up to the maintainers to decide.

Thanks,
Roman.

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879