Re: [PATCH kvm/queue v2 2/3] perf: x86/core: Add interface to query perfmon_event_map[] directly

Jim Mattson <jmattson@xxxxxxxxxx> · Thu, 10 Feb 2022 11:16:52 -0800

On Thu, Feb 10, 2022 at 10:30 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2/10/2022 11:34 AM, Jim Mattson wrote:
> > On Thu, Feb 10, 2022 at 7:34 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2/9/2022 2:24 PM, David Dunn wrote:
> >>> Dave,
> >>>
> >>> In my opinion, the right policy depends on what the host owner and
> >>> guest owner are trying to achieve.
> >>>
> >>> If the PMU is being used to locate places where performance could be
> >>> improved in the system, there are two sub scenarios:
> >>>      - The host and guest are owned by same entity that is optimizing
> >>> overall system.  In this case, the guest doesn't need PMU access and
> >>> better information is provided by profiling the entire system from the
> >>> host.
> >>>      - The host and guest are owned by different entities.  In this
> >>> case, profiling from the host can identify perf issues in the guest.
> >>> But what action can be taken?  The host entity must communicate issues
> >>> back to the guest owner through some sort of out-of-band information
> >>> channel.  On the other hand, preempting the host PMU to give the guest
> >>> a fully functional PMU serves this use case well.
> >>>
> >>> TDX and SGX (outside of debug mode) strongly assume different
> >>> entities.  And Intel is doing this to reduce insight of the host into
> >>> guest operations.  So in my opinion, preemption makes sense.
> >>>
> >>> There are also scenarios where the host owner is trying to identify
> >>> systemwide impacts of guest actions.  For example, detecting memory
> >>> bandwidth consumption or split locks.  In this case, host control
> >>> without preemption is necessary.
> >>>
> >>> To address these various scenarios, it seems like the host needs to be
> >>> able to have policy control on whether it is willing to have the PMU
> >>> preempted by the guest.
> >>>
> >>> But I don't see what scenario is well served by the current situation
> >>> in KVM.  Currently the guest will either be told it has no PMU (which
> >>> is fine) or that it has full control of a PMU.  If the guest is told
> >>> it has full control of the PMU, it actually doesn't.  But instead of
> >>> losing counters on well defined events (from the guest perspective),
> >>> they simply stop counting depending on what the host is doing with the
> >>> PMU.
> >>
> >> For the current perf subsystem, a PMU should be shared among different
> >> users via the multiplexing mechanism if the resource is limited. No one
> >> has full control of a PMU for lifetime. A user can only have the PMU in
> >> its given period. I think the user can understand how long it runs via
> >> total_time_enabled and total_time_running.
> >
> > For most clients, yes. For kvm, no. KVM currently tosses
> > total_time_enabled and total_time_running in the bitbucket. It could
> > extrapolate, but that would result in loss of precision. Some guest
> > uses of the PMU would not be able to cope (e.g.
> > https://github.com/rr-debugger/rr).
> >
> >> For a guest, it should rely on the host to tell whether the PMU resource
> >> is available. But unfortunately, I don't think we have such a
> >> notification mechanism in KVM. The guest has the wrong impression that
> >> the guest can have full control of the PMU.
> >
> > That is the only impression that the architectural specification
> > allows the guest to have. On Intel, we can mask off individual fixed
> > counters, and we can reduce the number of GP counters, but AMD offers
> > us no such freedom. Whatever resources we advertise to the guest must
> > be available for its use whenever it wants. Otherwise, PMU
> > virtualization is simply broken.
> >
> >> In my opinion, we should add the notification mechanism in KVM. When the
> >> PMU resource is limited, the guest can know whether it's multiplexing or
> >> can choose to reschedule the event.
> >
> > That sounds like a paravirtual perf mechanism, rather than PMU
> > virtualization. Are you suggesting that we not try to virtualize the
> > PMU? Unfortunately, PMU virtualization is what we have customers
> > clamoring for. No one is interested in a paravirtual perf mechanism.
> > For example, when will VTune in the guest know how to use your
> > proposed paravirtual interface?
>
> OK. If KVM cannot notify the guest, maybe guest can query the usage of
> counters before using a counter. There is a IA32_PERF_GLOBAL_INUSE MSR
> introduced with Arch perfmon v4. The MSR provides an "InUse" bit for
> each counters. But it cannot guarantee that the counter can always be
> owned by the guest unless the host treats the guest as a super-user and
> agrees to not touch its counter. This should only works for the Intel
> platforms.

Simple question: Do all existing guests (Windows and Linux are my
primary interest) query that MSR today? If not, then this proposal is
DOA.

> >
> >> But seems the notification mechanism may not work for TDX case?
> >>>
> >>> On the other hand, if we flip it around the semantics are more clear.
> >>> A guest will be told it has no PMU (which is fine) or that it has full
> >>> control of the PMU.  If the guest is told that it has full control of
> >>> the PMU, it does.  And the host (which is the thing that granted the
> >>> full PMU to the guest) knows that events inside the guest are not
> >>> being measured.  This results in all entities seeing something that
> >>> can be reasoned about from their perspective.
> >>>
> >>
> >> I assume that this is for the TDX case (where the notification mechanism
> >>    doesn't work). The host still control all the PMU resources. The TDX
> >> guest is treated as a super-user who can 'own' a PMU. The admin in the
> >> host can configure/change the owned PMUs of the TDX. Personally, I think
> >> it makes sense. But please keep in mind that the counters are not
> >> identical. There are some special events that can only run on a specific
> >> counter. If the special counter is assigned to TDX, other entities can
> >> never run some events. We should let other entities know if it happens.
> >> Or we should never let non-host entities own the special counter.
> >
> > Right; the counters are not fungible. Ideally, when the guest requests
> > a particular counter, that is the counter it gets. If it is given a
> > different counter, the counter it is given must provide the same
> > behavior as the requested counter for the event in question.
>
> Ideally, Yes, but sometimes KVM/host may not know whether they can use
> another counter to replace the requested counter, because KVM/host
> cannot retrieve the event constraint information from guest.

In that case, don't do it. When the guest asks for a specific counter,
give the guest that counter. This isn't rocket science.

> For example, we have Precise Distribution (PDist) feature enabled only
> for the GP counter 0 on SPR. Perf uses the precise_level 3 (a SW
> variable) to indicate the feature. For the KVM/host, they never know
> whether the guest apply the PDist feature.
>
> I have a patch that forces the perf scheduler starts from the regular
> counters, which may mitigates the issue, but cannot fix it. (I will post
> the patch separately.)
>
> Or we should never let the guest own the special counters. Although the
> guest has to lose some special events, I guess the host may more likely
> be willing to let the guest own a regular counter.
>
>
> Thanks,
> Kan
>
> >
> >>
> >> Thanks,
> >> Kan
> >>
> >>> Thanks,
> >>>
> >>> Dave Dunn
> >>>
> >>> On Wed, Feb 9, 2022 at 10:57 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> >>>
> >>>>> I was referring to gaps in the collection of data that the host perf
> >>>>> subsystem doesn't know about if ATTRIBUTES.PERFMON is set for a TDX
> >>>>> guest. This can potentially be a problem if someone is trying to
> >>>>> measure events per unit of time.
> >>>>
> >>>> Ahh, that makes sense.
> >>>>
> >>>> Does SGX cause problem for these people?  It can create some of the same
> >>>> collection gaps:
> >>>>
> >>>>           performance monitoring activities are suppressed when entering
> >>>>           an opt-out (of performance monitoring) enclave.