Re: [RFC PATCH 00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs

Jim Mattson <jmattson@xxxxxxxxxx> · Thu, 5 Dec 2024 05:48:33 -0800

On Thu, Dec 5, 2024 at 1:00 AM Nikunj A Dadhania <nikunj@xxxxxxx> wrote:
>
> On 11/22/2024 12:22 AM, Mingwei Zhang wrote:
> > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > the overhead of intercepting these frequent MSR reads, allow the guest
> > to read them directly by loading guest values into the hardware MSRs.
> >
> > These MSRs are continuously running counters whose values must be
> > carefully tracked during all vCPU state transitions:
> > - Guest IA32_APERF advances only during guest execution
> > - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
> >   in C0 state, even when not actively running
>
> Any particular reason to treat APERF and MPERF differently?

Core cycles accumulated by the logical processor that do not
contribute to the execution of the virtual processor should not be
counted. For example, consider Google Cloud's e2-small VM type, which
is capped at a 25% duty cycle. Even if the logical processor is
humming along at an effective frequency of 3.6 GHz, an e2-small vCPU
task is only resident 25% of the time, so its effective frequency is
more like 0.9 GHz (over a sufficiently large period of time).
Similarly, if a logical processor running at 3.6 GHz is shared 50/50
by two vCPUs, the effective frequency of each is about 1.8 GHz (again,
over a sufficiently large period of time). Over smaller time periods,
the effective frequencies in these examples would look like square
waves, alternating between 3.6 GHz and 0, much like thermal
throttling. And, much like thermal throttling, MPERF reference cycles
continue to tick on at the fixed reference frequency, even when APERF
cycles drop to 0.

> AFAIU, APERF and MPERF architecturally will count when the CPU is in C0 state.
> MPERF counting at constant frequency and the APERF counting at a variable
> frequency. Shouldn't we treat APERF and MPERF equal and keep on counting in C0
> state and even when "not actively running" ?
>
> Can you clarify what do you mean by "not actively running"?

The current implementation considers the vCPU to be actively running
if the task is in the KVM_RUN ioctl, between vcpu_load() and
vcpu_put(). This also implies that the task itself is currently
running on a logical processor, since there is a vcpu_put() on
sched_out and a vcpu_load() on sched_in. As Sean points out, this is
only an approximation, since (a) such things as I/O completion in
userspace are not counted, and (b) such things as uncompressing a
zswapped page that happen in the vCPU task are counted.

> Regards
> Nikunj
>