Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 29 Apr 2024 10:44:53 -0700

On Sat, Apr 27, 2024, Mingwei Zhang wrote:
> On Sat, Apr 27, 2024 at 5:59 PM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
> >
> >
> > On 4/27/2024 11:04 AM, Mingwei Zhang wrote:
> > > On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >> On Fri, Apr 26, 2024, Kan Liang wrote:
> > >>>> Optimization 4
> > >>>> allows the host side to immediately profiling this part instead of
> > >>>> waiting for vcpu to reach to PMU context switch locations. Doing so
> > >>>> will generate more accurate results.
> > >>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> > >>> definition of the exclude_guest. Without 4, it brings some random blind
> > >>> spots, right?
> > >> +1, I view it as a hard requirement.  It's not an optimization, it's about
> > >> accuracy and functional correctness.
> > > Well. Does it have to be a _hard_ requirement? no?

Assuming I understand how perf_event_open() works, which may be a fairly big
assumption, for me, yes, this is a hard requirement.

> > > The irq handler triggered by "perf record -a" could just inject a
> > > "state". Instead of immediately preempting the guest PMU context, perf
> > > subsystem could allow KVM defer the context switch when it reaches the
> > > next PMU context switch location.

FWIW, forcefully interrupting the guest isn't a hard requirement, but practically
speaking I think that will yield the simplest, most robust implementation.

> > > This is the same as the preemption kernel logic. Do you want me to
> > > stop the work immediately? Yes (if you enable preemption), or No, let
> > > me finish my job and get to the scheduling point.

Not really.  Task scheduling is by its nature completely exclusive, i.e. it's
not possible to concurrently run multiple tasks on a single logical CPU.  Given
a single CPU, to run task B, task A _must_ be scheduled out.

That is not the case here.  Profiling the host with exclude_guest=1 isn't mutually
exclusive with the guest using the PMU.  There's obviously the additional overhead
of switching PMU context, but the two uses themselves are not mutually exclusive.

And more importantly, perf_event_open() already has well-established ABI where it
can install events across CPUs.  And when perf_event_open() returns, userspace can
rely on the event being active and counting (assuming it wasn't disabled by default).

> > > Implementing this might be more difficult to debug. That's my real
> > > concern. If we do not enable preemption, the PMU context switch will
> > > only happen at the 2 pairs of locations. If we enable preemption, it
> > > could happen at any time.

Yes and no.  I agree that swapping guest/host state from IRQ context could lead
to hard to debug issues, but NOT doing so could also lead to hard to debug issues.
And even worse, those issues would likely be unique to specific kernel and/or
system configurations.

E.g. userspace creates an event, but sometimes it randomly doesn't count correctly.
Is the problem simply that it took a while for KVM to get to a scheduling point,
or is there a race lurking?  And what happens if the vCPU is the only runnable task
on its pCPU, i.e. never gets scheduled out?

Mix in all of the possible preemption and scheduler models, and other sources of
forced rescheduling, e.g. RCU, and the number of factors to account for becomes
quite terrifying.

> > IMO I don't prefer to add a switch to enable/disable the preemption. I
> > think current implementation is already complicated enough and
> > unnecessary to introduce an new parameter to confuse users. Furthermore,
> > the switch could introduce an uncertainty and may mislead the perf user
> > to read the perf stats incorrectly.

+1000.

> > As for debug, it won't bring any difference as long as no host event is created.
> >
> That's ok. It is about opinions and brainstorming. Adding a parameter
> to disable preemption is from the cloud usage perspective. The
> conflict of opinions is which one you prioritize: guest PMU or the
> host PMU? If you stand on the guest vPMU usage perspective, do you
> want anyone on the host to shoot a profiling command and generate
> turbulence? no. If you stand on the host PMU perspective and you want
> to profile VMM/KVM, you definitely want accuracy and no delay at all.

Hard no from me.  Attempting to support two fundamentally different models means
twice the maintenance burden.  The *best* case scenario is that usage is roughly
a 50/50 spit.  The worst case scenario is that the majority of users favor one
model over the other, thus resulting in extremely limited tested of the minority
model.

KVM already has this problem with scheduler preemption models, and it's painful.
The overwhelming majority of KVM users run non-preemptible kernels, and so our
test coverage for preemtible kernels is abysmal.

E.g. the TDP MMU effectively had a fatal flaw with preemptible kernels that went
unnoticed for many kernel releases[*], until _another_ bug introduced with dynamic
preemption models resulted in users running code that was supposed to be specific
to preemtible kernels.

[* https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@xxxxxxxxxxx