On 2024-04-25 4:16 p.m., Mingwei Zhang wrote: > On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote: >> >> >> >> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote: >>> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote: >>>> >>>> >>>> On 4/24/2024 11:00 PM, Sean Christopherson wrote: >>>>> On Wed, Apr 24, 2024, Dapeng Mi wrote: >>>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote: >>>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu >>>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is >>>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context >>>>>>>>> switch location could be an option. >>>>>>>> If there are two VMs with pmu enabled both, however host PMU is not >>>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path. >>>>>>>> >>>>>>>> If host pmu is used also, we can choose whether PMU switch should be >>>>>>>> done in vm exit path or vcpu thread sched-out path. >>>>>>>> >>>>>>> host PMU is always enabled, ie., Linux currently does not support KVM >>>>>>> PMU running standalone. I guess what you mean is there are no active >>>>>>> perf_events on the host side. Allowing a PMU context switch drifting >>>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host >>>>>>> side events might be a good option. We can keep the discussion, but I >>>>>>> won't propose that in v2. >>>>>> I suspect if it's really doable to do this deferring. This still makes host >>>>>> lose the most of capability to profile KVM. Per my understanding, most of >>>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling. >>>>>> We have no idea when host want to create perf event to profile KVM, it could >>>>>> be at any time. >>>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU >>>>> state actually needs to be loaded, i.e. only when there are relevant host events. >>>>> >>>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire >>>>> KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a host perf >>>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit, >>>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience >>>>> for the guest while host perf events are active. >>>> >>>> I see. So KVM needs to provide a callback which needs to be called in >>>> the IPI handler. The KVM callback needs to be called to switch PMU state >>>> before perf really enabling host event and touching PMU MSRs. And only >>>> the perf event with exclude_guest attribute is allowed to create on >>>> host. Thanks. >>> >>> Do we really need a KVM callback? I think that is one option. >>> >>> Immediately after VMEXIT, KVM will check whether there are "host perf >>> events". If so, do the PMU context switch immediately. Otherwise, keep >>> deferring the context switch to the end of vPMU loop. >>> >>> Detecting if there are "host perf events" would be interesting. The >>> "host perf events" refer to the perf_events on the host that are >>> active and assigned with HW counters and that are saved when context >>> switching to the guest PMU. I think getting those events could be done >>> by fetching the bitmaps in cpuc. >> >> The cpuc is ARCH specific structure. I don't think it can be get in the >> generic code. You probably have to implement ARCH specific functions to >> fetch the bitmaps. It probably won't worth it. >> >> You may check the pinned_groups and flexible_groups to understand if >> there are host perf events which may be scheduled when VM-exit. But it >> will not tell the idx of the counters which can only be got when the >> host event is really scheduled. >> >>> I have to look into the details. But >>> at the time of VMEXIT, kvm should already have that information, so it >>> can immediately decide whether to do the PMU context switch or not. >>> >>> oh, but when the control is executing within the run loop, a >>> host-level profiling starts, say 'perf record -a ...', it will >>> generate an IPI to all CPUs. Maybe that's when we need a callback so >>> the KVM guest PMU context gets preempted for the host-level profiling. >>> Gah.. >>> >>> hmm, not a fan of that. That means the host can poke the guest PMU >>> context at any time and cause higher overhead. But I admit it is much >>> better than the current approach. >>> >>> The only thing is that: any command like 'perf record/stat -a' shot in >>> dark corners of the host can preempt guest PMUs of _all_ running VMs. >>> So, to alleviate that, maybe a module parameter that disables this >>> "preemption" is possible? This should fit scenarios where we don't >>> want guest PMU to be preempted outside of the vCPU loop? >>> >> >> It should not happen. For the current implementation, perf rejects all >> the !exclude_guest system-wide event creation if a guest with the vPMU >> is running. >> However, it's possible to create an exclude_guest system-wide event at >> any time. KVM cannot use the information from the VM-entry to decide if >> there will be active perf events in the VM-exit. > > Hmm, why not? If there is any exclude_guest system-wide event, > perf_guest_enter() can return something to tell KVM "hey, some active > host events are swapped out. they are originally in counter #2 and > #3". If so, at the time when perf_guest_enter() returns, KVM will ack > that and keep it in its pmu data structure. I think it's possible that someone creates !exclude_guest event after the perf_guest_enter(). The stale information is saved in the KVM. Perf will schedule the event in the next perf_guest_exit(). KVM will not know it. > > Now, when doing context switching back to host at just VMEXIT, KVM > will check this data and see if host perf context has something active > (of course, they are all exclude_guest events). If not, deferring the > context switch to vcpu boundary. Otherwise, do the proper PMU context > switching by respecting the occupied counter positions on the host > side, i.e., avoid doubling the work on the KVM side. > > Kan, any suggestion on the above approach? I think we can only know the accurate event list at perf_guest_exit(). You may check the pinned_groups and flexible_groups, which tell if there are candidate events. > Totally understand that > there might be some difficulty, since perf subsystem works in several > layers and obviously fetching low-level mapping is arch specific work. > If that is difficult, we can split the work in two phases: 1) phase > #1, just ask perf to tell kvm if there are active exclude_guest events > swapped out; 2) phase #2, ask perf to tell their (low-level) counter > indices. > If you want an accurate counter mask, the changes in the arch specific code is required. Two phases sound good to me. Besides perf changes, I think the KVM should also track which counters need to be saved/restored. The information can be get from the EventSel interception. Thanks, Kan >> >> The perf_guest_exit() will reload the host state. It's impossible to >> save the guest state after that. We may need a KVM callback. So perf can >> tell KVM whether to save the guest state before perf reloads the host state. >> >> Thanks, >> Kan >>>> >>>> >>>>> >>>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom >>> >