On 2024-04-26 2:41 p.m., Mingwei Zhang wrote: >>> So this requires vcpu->pmu has two pieces of state information: 1) the >>> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1 >>> just a boolean; phase #2, bitmap of occupied counters). >>> >>> This is a non-trivial optimization on the PMU context switch. I am >>> thinking about splitting them into the following phases: >>> >>> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR >>> for the 1st time. >>> 2) fast PMU context switch on KVM side, i.e., KVM checking event >>> selector value (enable/disable) and selectively switch PMU state >>> (reducing rd/wr msrs) >>> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU >>> context switch boundary depending on existing active host-level >>> events. >>> 3.1) more accurate dynamic PMU context switch, ie., KVM checking >>> host-level counter position and further reduces the number of msr >>> accesses. >>> 4) guest PMU context preemption, i.e., any new host-level perf >>> profiling can immediately preempt the guest PMU in the vcpu loop >>> (instead of waiting for the next PMU context switch in KVM). >> I'm not quit sure about the 4. >> The new host-level perf must be an exclude_guest event. It should not be >> scheduled when a guest is using the PMU. Why do we want to preempt the >> guest PMU? The current implementation in perf doesn't schedule any >> exclude_guest events when a guest is running. > right. The grey area is the code within the KVM_RUN loop, but > _outside_ of the guest. This part of the code is on the "host" side. > However, for efficiency reasons, KVM defers the PMU context switch by > retaining the guest PMU MSR values within the loop. I assume you mean the optimization of moving the context switch from VM-exit/entry boundary to the vCPU boundary. > Optimization 4 > allows the host side to immediately profiling this part instead of > waiting for vcpu to reach to PMU context switch locations. Doing so > will generate more accurate results. If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the definition of the exclude_guest. Without 4, it brings some random blind spots, right? > > Do we want to preempt that? I think it depends. For regular cloud > usage, we don't. But for any other usages where we want to prioritize > KVM/VMM profiling over guest vPMU, it is useful. > > My current opinion is that optimization 4 is something nice to have. > But we should allow people to turn it off just like we could choose to > disable preempt kernel. The exclude_guest means everything but the guest. I don't see a reason why people want to turn it off and get some random blind spots. Thanks, Kan