Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

Mingwei Zhang <mizhang@xxxxxxxxxx> · Thu, 18 Apr 2024 21:41:01 +0000

On Thu, Apr 18, 2024, Mingwei Zhang wrote:
> On Mon, Apr 15, 2024, Sean Christopherson wrote:
> > On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> > > On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
> > > > On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > > > > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
> > > > >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> > > > >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> > > > >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> > > > >>>> unexpectedly on host later since Host perf always enable all validate bits
> > > > >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> > > > >>>>
> > > > >>>> Yeah,  the clearing for PMCx MSR should be unnecessary .
> > > > >>>>
> > > > >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > > > >>> values to the host? NO. Not in cloud usage.
> > > > >> No, this place is clearing the guest counter value instead of host
> > > > >> counter value. Host always has method to see guest value in a normal VM
> > > > >> if he want. I don't see its necessity, it's just a overkill and
> > > > >> introduce extra overhead to write MSRs.
> > > > >>
> > > > > I am curious how the perf subsystem solves the problem? Does perf
> > > > > subsystem in the host only scrubbing the selector but not the counter
> > > > > value when doing the context switch?
> > > >
> > > > When context switch happens, perf code would schedule out the old events
> > > > and schedule in the new events. When scheduling out, the ENABLE bit of
> > > > EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> > > > and PMCx MSRs would be overwritten with new event's attr.config and
> > > > sample_period separately.  Of course, these is only for the case when
> > > > there are new events to be programmed on the PMC. If no new events, the
> > > > PMCx MSR would keep stall value and won't be cleared.
> > > >
> > > > Anyway, I don't see any reason that PMCx MSR must be cleared.
> > > >
> > > 
> > > I don't have a strong opinion on the upstream version. But since both
> > > the mediated vPMU and perf are clients of PMU HW, leaving PMC values
> > > uncleared when transition out of the vPMU boundary is leaking info
> > > technically.
> > 
> > I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
> > in the guest's TCB, which is what I would consider a true leak.  I'm objecting
> > to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
> > PMCs when saving guest state without coordinating with perf in any way.
> 
> Agree. blindly clearing PMCs is the basic implementation. I am thinking
> about what coordination between perf and KVM as well.
> 
> > 
> > I am ok if we start with (or default to) a "safe" implementation that zeroes all
> > PMCs when switching to host context, but I want KVM and perf to work together to
> > do the context switches, e.g. so that we don't end up with code where KVM writes
> > to all PMC MSRs and that perf also immediately writes to all PMC MSRs.
> 
> Sure. Point taken.
> > 
> > One my biggest complaints with the current vPMU code is that the roles and
> > responsibilities between KVM and perf are poorly defined, which leads to suboptimal
> > and hard to maintain code.
> 
> Right.
> > 
> > Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
> > state to userspace processes that have RDPMC permissions, as the PMCs might not
> > be dirty from perf's perspective (see perf_clear_dirty_counters()).
> > 
> 
> ah. This is a good point.
> 
> 		switch_mm_irqs_off() =>
> 		cr4_update_pce_mm() =>
> 		/*
> 		 * Clear the existing dirty counters to
> 		 * prevent the leak for an RDPMC task.
> 		 */

FYI, for the elaboration of "an RDPMC task".

when echo 2> /sys/devices/cpu/rdpmc, kernel set CR4.PCE to 1.

Once that is done, rdpmc instruction is no longer a priviledged
instruction. It is allowed for all tasks to execute in userspace.

Thanks.
-Mingwei
> 		perf_clear_dirty_counters()
> 
> So perf does clear dirty counter values on process context switch. This
> is nice to know.
> 
> perf_clear_dirty_counters() clear the counter values according to
> cpuc->dirty except for those assigned counters.
> 
> > Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
> > overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
> > or if KVM is just being paranoid.
> 
> There is a difference between KVM and perf subsystem on PMU context
> switch. The latter has the notion of "perf_events", while the former
> currently does not. It is quite hard for KVM to know which counters are
> really "in use".
> 
> Another point I want to raise up to you is that, KVM PMU context switch
> and Perf PMU context switch happens at different timing:
> 
>  - The former is a context switch between guest/host state of the same
>    process, happening at VM-enter/exit boundary.
>  - The latter is a context switch beteen two host-level processes.
>  - The former happens before the latter.
>  - Current design has no PMC partitioning between host/guest due to
>    arch limitation.
> 
> From the above, I feel that it might be impossible to combine them or to
> add coordination? Unless we do the KVM PMU context switch at vcpu loop
> boundary...
> 
> Thanks.
> -Mingwei