Re: [PATCH V5 05/10] KVM: x86/pmu: Disable vPMU if the minimum num of counters isn't met

Like Xu <like.xu.linux@xxxxxxxxx> · Wed, 19 Apr 2023 17:34:55 +0800

Jim, sorry for the late reply.

On 11/4/2023 10:58 pm, Jim Mattson wrote:
On Tue, Apr 11, 2023 at 6:18 AM Like Xu <like.xu.linux@xxxxxxxxx> wrote:

On 11/4/2023 8:58 pm, Jim Mattson wrote:
On Mon, Apr 10, 2023 at 11:17 PM Like Xu <like.xu.linux@xxxxxxxxx> wrote:

On 11/4/2023 1:36 pm, Jim Mattson wrote:
On Mon, Apr 10, 2023 at 3:51 AM Like Xu <like.xu.linux@xxxxxxxxx> wrote:

From: Like Xu <likexu@xxxxxxxxxxx>

Disable PMU support when running on AMD and perf reports fewer than four
general purpose counters. All AMD PMUs must define at least four counters
due to AMD's legacy architecture hardcoding the number of counters
without providing a way to enumerate the number of counters to software,
e.g. from AMD's APM:

    The legacy architecture defines four performance counters (PerfCtrn)
    and corresponding event-select registers (PerfEvtSeln).

Virtualizing fewer than four counters can lead to guest instability as
software expects four counters to be available.

I'm confused. Isn't zero less than four?

As I understand it, you are saying that virtualization of zero counter is also
reasonable.
If so, the above statement could be refined as:

          Virtualizing fewer than four counters when vPMU is enabled may lead to guest
instability
          as software expects at least four counters to be available, thus the vPMU is
disabled if the
          minimum number of KVM supported counters is not reached during initialization.

Jim, does this help you or could you explain more about your confusion ?

You say that "fewer than four counters can lead to guest instability
as software expects four counters to be available." Your solution is
to disable the PMU, which leaves zero counters available. Zero is less
than four. Hence, by your claim, disabling the PMU can lead to guest
instability. I don't see how this is an improvement over one, two, or
three counters.

As you know, AMD pmu lacks an architected method (such as CPUID) to
indicate that the VM does not have any pmu counters available for the
current platform. Guests like Linux tend to check if their first counters
exist and work properly to infer that other pmu counters exist.

"Guests like Linux," or just Linux? What do you mean by "tend"? When
do they perform this check, and when do they not?

We do not know how guests that do not disclose their source code
will detect the presence of pmu counters.

For upstream Linux guests, such a check is implemented in the check_hw_exists(),
which checks the counters one by one, often with an error on the first counter,
and then disables pmu from the kernel perspective.

The key point is that the KVM implementation cannot rely on assumptions about
the guest kernel version, and considering that the above check was added very early,
existing Linux guest instances will most likely (tend to) check the first 
counter and
error out (a VM could also check all of the possible counters and use a bitmap with
holes to track any functional counters).


If KVM chooses to emulate greater than 1 less than 4 counters, then the
AMD guest PMU agent may assume that there are legacy 4 counters all
present (it's what the APM specifies), which requires the legacy code
to add #GP error handling for counters that should exist but actually not.

I would argue that regardless of the number of counters emulated, a
guest PMU agent may assume that the 4 legacy counters are present,
since that's what the APM specifies.

I certainly agree that, for example, a particular cpu model is stated in the spec
to have certain features (e.g. uncore pmu), but the KVM does not or chooses
not ro emulate them, for security reasons (e.g. side channel attacks), which
does violate the defined behavior of the hardware spec, such as here where
enable_pmu is false, which is not possible on almost all real hardware today.


So at Sean's suggestion, we took a conservative approach. If KVM detects
less than 4 counters, we think KVM (under the current configuration and
platform) is not capable of emulating the most basic AMD pmu capability.
A large number of legacy instances are ready for 0 or 4+ ctrs, not 2 or 3

Which specific guest operating systems is this change intended for?

Does this help you ? I wouldn't mind a better move.

Which AMD platforms have less than 4 counters available?

All this is for L2 Linux guest, as pmu on L1 Linux guest will be disabled by L0.





Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx>
Signed-off-by: Like Xu <likexu@xxxxxxxxxxx>
---
    arch/x86/kvm/pmu.h | 3 +++
    1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index dd7c7d4ffe3b..002b527360f4 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -182,6 +182,9 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
                           enable_pmu = false;
           }

+       if (!is_intel && kvm_pmu_cap.num_counters_gp < AMD64_NUM_COUNTERS)

Does this actually guarantee that the requisite number of counters are
available and will always be available while the guest is running?

Not 100%, the scheduling of physical counters depends on the host perf scheduler.

I noticed that many cloud vendors want to make sure that hardware resources
are given exclusively to VMs, but for upstream, the availability of resources
should depend entirely on the host administrators, and a VMM should take away
access to resources at any time, such as vcpu time slice.

Any attempts in the direction of exclusive use will be thwarted.

What happens if some other client of the host perf subsystem requests
a CPU-pinned counter after this checck?

Normal perf use does not grab the counters allocated for kvm, NMI-watchdog
maybe one, but it will be moved to other timer hardware like HPET.

Of interest is that some ebpf programs that access the pmu hardware directly
use the interface that perf sub-system presents to KVM in the kernel.


+               enable_pmu = false;
+
           if (!enable_pmu) {
                   memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
                   return;
--
2.40.0