Re: [RFC v4 1/1] selftests/cpuidle: Add support for cpuidle latency measurement

Pratik Sampat <psampat@xxxxxxxxxxxxx> · Mon, 14 Sep 2020 12:01:09 +0530

On 03/09/20 8:20 pm, Artem Bityutskiy wrote:
On Thu, 2020-09-03 at 17:30 +0530, Pratik Sampat wrote:
I certainly did not know about that the Intel architecture being aware
of timers and pre-wakes the CPUs which makes the timer experiment
observations void.
Well, things depend on platform, it is really "void", it is just
different and it measures an optimized case. The result may be smaller
observed latency. And things depend on the platform.

Of course, this will be for just software observability and hardware
can be more complex with each architecture behaving differently.

However, we are also collecting a baseline measurement wherein we run
the same test on a 100% busy CPU and the measurement of latency from
that could be considered to the kernel-userspace overhead.
The rest of the measurements would be considered keeping this baseline
in mind.
Yes, this should give the idea of the overhead, but still, at least for
many Intel platforms I would not be comfortable using the resulting
number (measured latency - baseline) for a cpuidle driver, because
there are just too many variables there. I am not sure I could assume
the baseline measured this way is an invariant - it could be noticeably
different depending on whether you use C-states or not.

At least on Intel platforms, this will mean that the IPI method won't
cover deep C-states like, say, PC6, because one CPU is busy. Again, not
saying this is not interesting, just pointing out the limitation.
That's a valid point. We have similar deep idle states in POWER too.
The idea here is that this test should be run on an already idle
system, of course there will be kernel jitters along the way
which can cause little skewness in observations across some CPUs but I
believe the observations overall should be stable.
If baseline and cpuidle latency are numbers of same order of magnitude,
and you are measuring in a controlled lab system, may be yes. But if
baseline is, say, in milliseconds, and you are measuring a 10
microseconds C-state, then probably no.

This makes complete sense. The magnitude of deviations being greater
than the scope of the experiment may not be very useful in quantifying
the latency metric.

One way is to minimize the baseline overhead is to make this a kernel
module https://lkml.org/lkml/2020/7/21/567. However, the overhead is
unavoidable but definetly can be further minimized by using an external
approach suggested by you in your LPC talk

Another solution to this could be using isolcpus, but that just
increases the complexity all the more.
If you have any suggestions of any other way that could guarantee
idleness that would be great.
Well, I did not try to guarantee idleness. I just use timers and
external device (the network card), so no CPUs needs to be busy and the
system can enter deep C-states. Then I just look at median, 99%-th
percentile, etc.

But by all means IPI is also a very interesting experiment. Just covers
a different usage scenario.

When I started experimenting in this area, one of my main early
takeaways was realization that C-state latency really depends on the
event source.

That is an interesting observation, on POWER systems where we don't
have timer related wakeup optimizations, the readings from this test do
signify a difference between latencies of IPI versus the latency
gathered after a timer interrupt.

However, these timer based variations weren't as prominent on my Intel
based ThinkPad t480, therefore in confirmation with your observations.

This discussions does help!
Although this approach may not help quantify latency deviations at a
hardware-accurate level but could still be helpful in quantifying this
metric from a software observability point of view.

Thanks!
Pratik