Re: [PATCH kvm/queue v2 2/3] perf: x86/core: Add interface to query perfmon_event_map[] directly

Like Xu <like.xu.linux@xxxxxxxxx> · Wed, 16 Feb 2022 15:36:42 +0800

On 15/2/2022 6:55 am, Jim Mattson wrote:
On Mon, Feb 14, 2022 at 1:55 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:

On 2/12/2022 6:31 PM, Jim Mattson wrote:
On Fri, Feb 11, 2022 at 1:47 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:

On 2/11/2022 1:08 PM, Jim Mattson wrote:
On Fri, Feb 11, 2022 at 6:11 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:

On 2/10/2022 2:55 PM, David Dunn wrote:
Kan,

On Thu, Feb 10, 2022 at 11:46 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:

No, we don't, at least for Linux. Because the host own everything. It
doesn't need the MSR to tell which one is in use. We track it in an SW way.

For the new request from the guest to own a counter, I guess maybe it is
worth implementing it. But yes, the existing/legacy guest never check
the MSR.

This is the expectation of all software that uses the PMU in every
guest.  It isn't just the Linux perf system.

Hardware resources will always be limited (the good news is that
the number of counters will increase in the future), and when we have
a common need for the same hardware, we should prioritize overall
fairness over a GUEST-first strategy.

The KVM vPMU model we have today results in the PMU utilizing software
simply not working properly in a guest.  The only case that can
consistently "work" today is not giving the guest a PMU at all.

Actually I'd be more interested to check the exact "not working properly" usage.

And that's why you are hearing requests to gift the entire PMU to the
guest while it is running. All existing PMU software knows about the
various constraints on exactly how each MSR must be used to get sane
data.  And by gifting the entire PMU it allows that software to work
properly.  But that has to be controlled by policy at host level such
that the owner of the host knows that they are not going to have PMU
visibility into guests that have control of PMU.

Don't forget that the perf subsystem manages uncore-pmu and
other profiling resources as well. Making host perf visible to some
resources and invisible to others would require big effort and it
would be very unmaintainable as the pmu hardware resources
become more and more numerous.

IMO, there is no future in having host perf get special for KVM.

I think here is how a guest event works today with KVM and perf subsystem.
        - Guest create an event A
        - The guest kernel assigns a guest counter M to event A, and config
the related MSRs of the guest counter M.
        - KVM intercepts the MSR access and create a host event B. (The
host event B is based on the settings of the guest counter M. As I said,
at least for Linux, some SW config impacts the counter assignment. KVM
never knows it. Event B can only be a similar event to A.)
        - Linux perf subsystem assigns a physical counter N to a host event
B according to event B's constraint. (N may not be the same as M,
because A and B may have different event constraints)

There is also the priority availability of previous scheduling policies.
This cross-mapping is relatively common and normal in the real world.

As you can see, even the entire PMU is given to the guest, we still
cannot guarantee that the physical counter M can be assigned to the
guest event A.

All we know about the guest is that it has programmed virtual counter
M. It seems obvious to me that we can satisfy that request by giving
it physical counter M. If, for whatever reason, we give it physical
counter N isntead, and M and N are not completely fungible, then we
have failed.

Only host perf has the global perspective to derive interchangeability,
which is a reasonable and flexible design.

How to fix it? The only thing I can imagine is "passthrough". Let KVM
directly assign the counter M to guest. So, to me, this policy sounds
like let KVM replace the perf to control the whole PMU resources, and we
will handover them to our guest then. Is it what we want?

I would prefer to see more code on this idea, as I am very unsure
of my code practice based on the current hardware interface, and
the TDX PMU enablement work might be a good starting point.

We want PMU virtualization to work. There are at least two ways of doing that:
1) Cede the entire PMU to the guest while it's running.

So the guest will take over the control of the entire PMUs while it's
running. I know someone wants to do system-wide monitoring. This case
will be failed.

I think from the vm-entry/exit level of granularity, we can do it
even if someone wants to do system-wide monitoring, and the
hard part is the sharing, synchronization and filtering of perf data.

We have system-wide monitoring for fleet efficiency, but since there's
nothing we can do about the efficiency of the guest (and those cycles
are paid for by the customer, anyway), I don't think our efficiency
experts lose any important insights if guest cycles are a blind spot.

Visibility of (non-tdx/sev) guest performance data helps with global optimal
solutions, such as memory migration and malicious attack detection, and
there are too many other PMU use cases that ignore the vm-entry boundary.

Others, e.g., NMI watchdog, also occupy a performance counter. I think
the NMI watchdog is enabled by default at least for the current Linux
kernel. You have to disable all such cases in the host when the guest is
running.

It doesn't actually make any sense to run the NMI watchdog while in
the guest, does it?

I wouldn't think so, AFAI a lot of guest vendor images run the NMI watchdog.

I'm not sure whether you can fully trust the guest. If malware runs in
the guest, I don't know whether it will harm the entire system. I'm not
a security expert, but it sounds dangerous.
Hope the administrators know what they are doing when choosing this policy.

We always consider the security attack surface when choosing which
pmu features can be virtualized. Until someone gives us a negative answer,
then we quietly and urgently fix it. We already have the module parameter
for vPMU, and the per-vm parameters are in the air.

Virtual machines are inherently dangerous. :-)

Without a specific context, any hardware or software stack is inherently dangerous.

Despite our concerns about PMU side-channels, Intel is constantly
reminding us that no such attacks are yet known. We would probably
restrict some events to guests that occupy an entire socket, just to
be safe.

The socket level performance data should not be available for current guests.

Note that on the flip side, TDX and SEV are all about catering to
guests that don't trust the host. Those customers probably don't want
the host to be able to use the PMU to snoop on guest activity.

Of course, for non-tdx/sev guests, we (upstream) should not prevent the PMU
from being used to snoop on guest activity. How it is used is a matter for
the administrators, but usability or feasibility is a technical consideration.

2) Introduce a new "ultimate" priority level in the host perf
subsystem. Only KVM can request events at the ultimate priority, and
these requests supersede any other requests.

The "ultimate" priority level doesn't help in the above case. The
counter M may not bring the precise which guest requests. I remember you
called it "broken".

Ideally, ultimate priority also comes with the ability to request
specific counters.

We need to relay the vcpu's specific needs for counters of different
capabilities to the host perf so that it can make the right decisions.

KVM can fails the case, but KVM cannot notify the guest. The guest still
see wrong result.

Indeed, the pain point is "KVM cannot notify the guest.", or we need to
remove all the "reasons for notification" one by one.

Other solutions are welcome.

I don't have a perfect solution to achieve all your requirements. Based
on my understanding, the guest has to be compromised by either
tolerating some errors or dropping some features (e.g., some special
events). With that, we may consider the above "ultimate" priority level
policy. The default policy would be the same as the current
implementation, where the host perf treats all the users, including the
guest, equally. The administrators can set the "ultimate" priority level
policy, which may let the KVM/guest pin/own some regular counters via
perf subsystem. That's just my personal opinion for your reference.

I disagree. The guest does not have to be compromised. For a proof of
concept, see VMware ESXi. Probably Microsoft Hyper-V as well, though I
haven't checked.

As far as I know, VMware ESXi has its own VMkernel, which can owns the
entire HW PMUs.  The KVM is part of the Linux kernel. The HW PMUs should
be shared among components/users. I think the cases are different.

Architecturally, ESXi is not very different from Linux. The VMkernel
is a posix-compliant kernel, and VMware's "vmm" is comparable to kvm.

Also, from what I searched on the VMware website, they still encounter
the case that a guest VM may not get a performance monitoring counter.
It looks like their solution is to let guest OS check the availability
of the counter, which is similar to the solution I mentioned (Use
GLOBAL_INUSE MSR).

"If an ESXi host's BIOS uses a performance counter or if Fault Tolerance
is enabled, some virtual performance counters might not be available for
the virtual machine to use."

I'm perfectly happy to give up PMCs on Linux under those same conditions.

https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vm_admin.doc/GUID-F920A3C7-3B42-4E78-8EA7-961E49AF479D.html

"In general, if a physical CPU PMC is in use, the corresponding virtual
CPU PMC is not functional and is unavailable for use by the guest. Guest
OS software detects unavailable general purpose PMCs by checking for a
non-zero event select MSR value when a virtual machine powers on."

https://kb.vmware.com/s/article/2030221

Linux, at least, doesn't do that. Maybe Windows does.

Thanks,
Kan