On Wed, May 01, 2024, Mingwei Zhang wrote: > On Mon, Apr 29, 2024 at 10:44 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Sat, Apr 27, 2024, Mingwei Zhang wrote: > > > That's ok. It is about opinions and brainstorming. Adding a parameter > > > to disable preemption is from the cloud usage perspective. The > > > conflict of opinions is which one you prioritize: guest PMU or the > > > host PMU? If you stand on the guest vPMU usage perspective, do you > > > want anyone on the host to shoot a profiling command and generate > > > turbulence? no. If you stand on the host PMU perspective and you want > > > to profile VMM/KVM, you definitely want accuracy and no delay at all. > > > > Hard no from me. Attempting to support two fundamentally different models means > > twice the maintenance burden. The *best* case scenario is that usage is roughly > > a 50/50 spit. The worst case scenario is that the majority of users favor one > > model over the other, thus resulting in extremely limited tested of the minority > > model. > > > > KVM already has this problem with scheduler preemption models, and it's painful. > > The overwhelming majority of KVM users run non-preemptible kernels, and so our > > test coverage for preemtible kernels is abysmal. > > > > E.g. the TDP MMU effectively had a fatal flaw with preemptible kernels that went > > unnoticed for many kernel releases[*], until _another_ bug introduced with dynamic > > preemption models resulted in users running code that was supposed to be specific > > to preemtible kernels. > > > > [* https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@xxxxxxxxxxx > > > > I hear your voice, Sean. > > In our cloud, we have a host-level profiling going on for all cores > periodically. It will be profiling X seconds every Y minute. Having > the host-level profiling using exclude_guest is fine, but stopping the > host-level profiling is a no no. Tweaking the X and Y is theoretically > possible, but highly likely out of the scope of virtualization. Now, > some of the VMs might be actively using vPMU at the same time. How can > we properly ensure the guest vPMU has consistent performance? Instead > of letting the VM suffer from the high overhead of PMU for X seconds > of every Y minute? > > Any thought/help is appreciated. I see the logic of having preemption > there for correctness of the profiling on the host level. Doing this, > however, negatively impacts the above business usage. > > One of the things on top of the mind is that: there seems to be no way > for the perf subsystem to express this: "no, your host-level profiling > is not interested in profiling the KVM_RUN loop when our guest vPMU is > actively running". For good reason, IMO. The KVM_RUN loop can reach _far_ outside of KVM, especially when IRQs and NMIs are involved. I don't think anyone can reasonably say that profiling is never interested in what happens while a task in KVM_RUN. E.g. if there's a bottleneck in some memory allocation flow that happens to be triggered in the greater KVM_RUN loop, that's something we'd want to show up in our profiling data. And if our systems our properly configured, for VMs with a mediated/passthrough PMU, 99.99999% of their associated pCPU's time should be spent in KVM_RUN. If that's our reality, what's the point of profiling if KVM_RUN is out of scope? We could make the context switching logic more sophisticated, e.g. trigger a context switch when control leaves KVM, a la the ASI concepts, but that's all but guaranteed to be overkill, and would have a very high maintenance cost. But we can likely get what we want (low observed overhead from the guest) while still context switching PMU state in vcpu_enter_guest(). KVM already handles the hottest VM-Exit reasons in its fastpath, i.e without triggering a PMU context switch. For a variety of reason, I think we should be more aggressive and handle more VM-Exits in the fastpath, e.g. I can't think of any reason KVM can't handle fast page faults in the fastpath. If we handle that overwhelming majority of VM-Exits in the fastpath when the guest is already booted, e.g. when vCPUs aren't taking a high number of "slow" VM-Exits, then the fact that slow VM-Exits trigger a PMU context switch should be a non-issue, because taking a slow exit would be a rare operation. I.e. rather than solving the overhead problem by moving around the context switch logic, solve the problem by moving KVM code inside the "guest PMU" section. It's essentially a different way of doing the same thing, with the critical difference being that only hand-selected flows are excluded from profiling, i.e. only the flows that need to be blazing fast and should be uninteresting from a profiling perspective are excluded.