On Thu, Feb 10, 2022 at 10:30 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote: > > > > On 2/10/2022 11:34 AM, Jim Mattson wrote: > > On Thu, Feb 10, 2022 at 7:34 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote: > >> > >> > >> > >> On 2/9/2022 2:24 PM, David Dunn wrote: > >>> Dave, > >>> > >>> In my opinion, the right policy depends on what the host owner and > >>> guest owner are trying to achieve. > >>> > >>> If the PMU is being used to locate places where performance could be > >>> improved in the system, there are two sub scenarios: > >>> - The host and guest are owned by same entity that is optimizing > >>> overall system. In this case, the guest doesn't need PMU access and > >>> better information is provided by profiling the entire system from the > >>> host. > >>> - The host and guest are owned by different entities. In this > >>> case, profiling from the host can identify perf issues in the guest. > >>> But what action can be taken? The host entity must communicate issues > >>> back to the guest owner through some sort of out-of-band information > >>> channel. On the other hand, preempting the host PMU to give the guest > >>> a fully functional PMU serves this use case well. > >>> > >>> TDX and SGX (outside of debug mode) strongly assume different > >>> entities. And Intel is doing this to reduce insight of the host into > >>> guest operations. So in my opinion, preemption makes sense. > >>> > >>> There are also scenarios where the host owner is trying to identify > >>> systemwide impacts of guest actions. For example, detecting memory > >>> bandwidth consumption or split locks. In this case, host control > >>> without preemption is necessary. > >>> > >>> To address these various scenarios, it seems like the host needs to be > >>> able to have policy control on whether it is willing to have the PMU > >>> preempted by the guest. > >>> > >>> But I don't see what scenario is well served by the current situation > >>> in KVM. Currently the guest will either be told it has no PMU (which > >>> is fine) or that it has full control of a PMU. If the guest is told > >>> it has full control of the PMU, it actually doesn't. But instead of > >>> losing counters on well defined events (from the guest perspective), > >>> they simply stop counting depending on what the host is doing with the > >>> PMU. > >> > >> For the current perf subsystem, a PMU should be shared among different > >> users via the multiplexing mechanism if the resource is limited. No one > >> has full control of a PMU for lifetime. A user can only have the PMU in > >> its given period. I think the user can understand how long it runs via > >> total_time_enabled and total_time_running. > > > > For most clients, yes. For kvm, no. KVM currently tosses > > total_time_enabled and total_time_running in the bitbucket. It could > > extrapolate, but that would result in loss of precision. Some guest > > uses of the PMU would not be able to cope (e.g. > > https://github.com/rr-debugger/rr). > > > >> For a guest, it should rely on the host to tell whether the PMU resource > >> is available. But unfortunately, I don't think we have such a > >> notification mechanism in KVM. The guest has the wrong impression that > >> the guest can have full control of the PMU. > > > > That is the only impression that the architectural specification > > allows the guest to have. On Intel, we can mask off individual fixed > > counters, and we can reduce the number of GP counters, but AMD offers > > us no such freedom. Whatever resources we advertise to the guest must > > be available for its use whenever it wants. Otherwise, PMU > > virtualization is simply broken. > > > >> In my opinion, we should add the notification mechanism in KVM. When the > >> PMU resource is limited, the guest can know whether it's multiplexing or > >> can choose to reschedule the event. > > > > That sounds like a paravirtual perf mechanism, rather than PMU > > virtualization. Are you suggesting that we not try to virtualize the > > PMU? Unfortunately, PMU virtualization is what we have customers > > clamoring for. No one is interested in a paravirtual perf mechanism. > > For example, when will VTune in the guest know how to use your > > proposed paravirtual interface? > > OK. If KVM cannot notify the guest, maybe guest can query the usage of > counters before using a counter. There is a IA32_PERF_GLOBAL_INUSE MSR > introduced with Arch perfmon v4. The MSR provides an "InUse" bit for > each counters. But it cannot guarantee that the counter can always be > owned by the guest unless the host treats the guest as a super-user and > agrees to not touch its counter. This should only works for the Intel > platforms. Simple question: Do all existing guests (Windows and Linux are my primary interest) query that MSR today? If not, then this proposal is DOA. > > > >> But seems the notification mechanism may not work for TDX case? > >>> > >>> On the other hand, if we flip it around the semantics are more clear. > >>> A guest will be told it has no PMU (which is fine) or that it has full > >>> control of the PMU. If the guest is told that it has full control of > >>> the PMU, it does. And the host (which is the thing that granted the > >>> full PMU to the guest) knows that events inside the guest are not > >>> being measured. This results in all entities seeing something that > >>> can be reasoned about from their perspective. > >>> > >> > >> I assume that this is for the TDX case (where the notification mechanism > >> doesn't work). The host still control all the PMU resources. The TDX > >> guest is treated as a super-user who can 'own' a PMU. The admin in the > >> host can configure/change the owned PMUs of the TDX. Personally, I think > >> it makes sense. But please keep in mind that the counters are not > >> identical. There are some special events that can only run on a specific > >> counter. If the special counter is assigned to TDX, other entities can > >> never run some events. We should let other entities know if it happens. > >> Or we should never let non-host entities own the special counter. > > > > Right; the counters are not fungible. Ideally, when the guest requests > > a particular counter, that is the counter it gets. If it is given a > > different counter, the counter it is given must provide the same > > behavior as the requested counter for the event in question. > > Ideally, Yes, but sometimes KVM/host may not know whether they can use > another counter to replace the requested counter, because KVM/host > cannot retrieve the event constraint information from guest. In that case, don't do it. When the guest asks for a specific counter, give the guest that counter. This isn't rocket science. > For example, we have Precise Distribution (PDist) feature enabled only > for the GP counter 0 on SPR. Perf uses the precise_level 3 (a SW > variable) to indicate the feature. For the KVM/host, they never know > whether the guest apply the PDist feature. > > I have a patch that forces the perf scheduler starts from the regular > counters, which may mitigates the issue, but cannot fix it. (I will post > the patch separately.) > > Or we should never let the guest own the special counters. Although the > guest has to lose some special events, I guess the host may more likely > be willing to let the guest own a regular counter. > > > Thanks, > Kan > > > > >> > >> Thanks, > >> Kan > >> > >>> Thanks, > >>> > >>> Dave Dunn > >>> > >>> On Wed, Feb 9, 2022 at 10:57 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > >>> > >>>>> I was referring to gaps in the collection of data that the host perf > >>>>> subsystem doesn't know about if ATTRIBUTES.PERFMON is set for a TDX > >>>>> guest. This can potentially be a problem if someone is trying to > >>>>> measure events per unit of time. > >>>> > >>>> Ahh, that makes sense. > >>>> > >>>> Does SGX cause problem for these people? It can create some of the same > >>>> collection gaps: > >>>> > >>>> performance monitoring activities are suppressed when entering > >>>> an opt-out (of performance monitoring) enclave.