Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

Mingwei Zhang <mizhang@xxxxxxxxxx> · Wed, 1 May 2024 10:43:58 -0700

On Mon, Apr 29, 2024 at 10:44 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Sat, Apr 27, 2024, Mingwei Zhang wrote:
> > On Sat, Apr 27, 2024 at 5:59 PM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
> > >
> > >
> > > On 4/27/2024 11:04 AM, Mingwei Zhang wrote:
> > > > On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > >> On Fri, Apr 26, 2024, Kan Liang wrote:
> > > >>>> Optimization 4
> > > >>>> allows the host side to immediately profiling this part instead of
> > > >>>> waiting for vcpu to reach to PMU context switch locations. Doing so
> > > >>>> will generate more accurate results.
> > > >>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> > > >>> definition of the exclude_guest. Without 4, it brings some random blind
> > > >>> spots, right?
> > > >> +1, I view it as a hard requirement.  It's not an optimization, it's about
> > > >> accuracy and functional correctness.
> > > > Well. Does it have to be a _hard_ requirement? no?
>
> Assuming I understand how perf_event_open() works, which may be a fairly big
> assumption, for me, yes, this is a hard requirement.
>
> > > > The irq handler triggered by "perf record -a" could just inject a
> > > > "state". Instead of immediately preempting the guest PMU context, perf
> > > > subsystem could allow KVM defer the context switch when it reaches the
> > > > next PMU context switch location.
>
> FWIW, forcefully interrupting the guest isn't a hard requirement, but practically
> speaking I think that will yield the simplest, most robust implementation.
>
> > > > This is the same as the preemption kernel logic. Do you want me to
> > > > stop the work immediately? Yes (if you enable preemption), or No, let
> > > > me finish my job and get to the scheduling point.
>
> Not really.  Task scheduling is by its nature completely exclusive, i.e. it's
> not possible to concurrently run multiple tasks on a single logical CPU.  Given
> a single CPU, to run task B, task A _must_ be scheduled out.
>
> That is not the case here.  Profiling the host with exclude_guest=1 isn't mutually
> exclusive with the guest using the PMU.  There's obviously the additional overhead
> of switching PMU context, but the two uses themselves are not mutually exclusive.
>
> And more importantly, perf_event_open() already has well-established ABI where it
> can install events across CPUs.  And when perf_event_open() returns, userspace can
> rely on the event being active and counting (assuming it wasn't disabled by default).
>
> > > > Implementing this might be more difficult to debug. That's my real
> > > > concern. If we do not enable preemption, the PMU context switch will
> > > > only happen at the 2 pairs of locations. If we enable preemption, it
> > > > could happen at any time.
>
> Yes and no.  I agree that swapping guest/host state from IRQ context could lead
> to hard to debug issues, but NOT doing so could also lead to hard to debug issues.
> And even worse, those issues would likely be unique to specific kernel and/or
> system configurations.
>
> E.g. userspace creates an event, but sometimes it randomly doesn't count correctly.
> Is the problem simply that it took a while for KVM to get to a scheduling point,
> or is there a race lurking?  And what happens if the vCPU is the only runnable task
> on its pCPU, i.e. never gets scheduled out?
>
> Mix in all of the possible preemption and scheduler models, and other sources of
> forced rescheduling, e.g. RCU, and the number of factors to account for becomes
> quite terrifying.
>
> > > IMO I don't prefer to add a switch to enable/disable the preemption. I
> > > think current implementation is already complicated enough and
> > > unnecessary to introduce an new parameter to confuse users. Furthermore,
> > > the switch could introduce an uncertainty and may mislead the perf user
> > > to read the perf stats incorrectly.
>
> +1000.
>
> > > As for debug, it won't bring any difference as long as no host event is created.
> > >
> > That's ok. It is about opinions and brainstorming. Adding a parameter
> > to disable preemption is from the cloud usage perspective. The
> > conflict of opinions is which one you prioritize: guest PMU or the
> > host PMU? If you stand on the guest vPMU usage perspective, do you
> > want anyone on the host to shoot a profiling command and generate
> > turbulence? no. If you stand on the host PMU perspective and you want
> > to profile VMM/KVM, you definitely want accuracy and no delay at all.
>
> Hard no from me.  Attempting to support two fundamentally different models means
> twice the maintenance burden.  The *best* case scenario is that usage is roughly
> a 50/50 spit.  The worst case scenario is that the majority of users favor one
> model over the other, thus resulting in extremely limited tested of the minority
> model.
>
> KVM already has this problem with scheduler preemption models, and it's painful.
> The overwhelming majority of KVM users run non-preemptible kernels, and so our
> test coverage for preemtible kernels is abysmal.
>
> E.g. the TDP MMU effectively had a fatal flaw with preemptible kernels that went
> unnoticed for many kernel releases[*], until _another_ bug introduced with dynamic
> preemption models resulted in users running code that was supposed to be specific
> to preemtible kernels.
>
> [* https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@xxxxxxxxxxx
>

I hear your voice, Sean.

In our cloud, we have a host-level profiling going on for all cores
periodically. It will be profiling X seconds every Y minute. Having
the host-level profiling using exclude_guest is fine, but stopping the
host-level profiling is a no no. Tweaking the X and Y is theoretically
possible, but highly likely out of the scope of virtualization. Now,
some of the VMs might be actively using vPMU at the same time. How can
we properly ensure the guest vPMU has consistent performance? Instead
of letting the VM suffer from the high overhead of PMU for X seconds
of every Y minute?

Any thought/help is appreciated. I see the logic of having preemption
there for correctness of the profiling on the host level. Doing this,
however, negatively impacts the above business usage.

One of the things on top of the mind is that: there seems to be no way
for the perf subsystem to express this: "no, your host-level profiling
is not interested in profiling the KVM_RUN loop when our guest vPMU is
actively running".

Thanks.
-Mingwei

-Mingwei