Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss

Vadim Fedorenko <vadim.fedorenko@xxxxxxxxx> · Thu, 28 Nov 2024 14:28:55 +0000

On 28/11/2024 11:27, Peter Zijlstra wrote:
On Tue, Nov 26, 2024 at 10:12:57AM -0800, Andrii Nakryiko wrote:
On Fri, Nov 22, 2024 at 3:34 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
This patchset adds 2 kfuncs to provide a way to precisely measure the
time spent running some code. The first patch provides a way to get cpu
cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
architecture it is effectively rdtsc_ordered() function while on other
architectures it falls back to __arch_get_hw_counter(). The second patch
adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
constants discovered by kernel. The main use-case for this kfunc is to
convert deltas of timestamp counter values into nanoseconds. It is not
supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
JIT version is done for x86 for now, on other architectures it falls
back to slightly simplified version of vdso_calc_ns.

So having now read this. I'm still left wondering why you would want to
do this.

Is this just debug stuff, for when you're doing a poor man's profile
run? If it is, why do we care about all the precision or the ns. And why
aren't you using perf?

No, it's not debug stuff. It's meant to be used in production for
measuring durations of whatever is needed. Like uprobe entry/exit
duration, or time between scheduling switches, etc.

Vadim emphasizes benchmarking at scale, but that's a bit misleading.
It's not "benchmarking", it's measuring durations of relevant pairs of
events. In production and at scale, so the unnecessary overhead all
adds up. We'd like to have the minimal possible overhead for this time
passage measurement. And some durations are very brief,

You might want to consider leaving out the LFENCE before the RDTSC on
some of those, LFENCE isn't exactly cheap.

I was considering this option. Unfortunately, RDTSC without LFENCE may
be executed well out of order by CPU and can easily bring more noise.
We have seen some effects of LFENCE being quite expensive on high core
count machines and we will continue monitor it. I might add another
helper in the future if the situation gets unacceptable.

so precision
matters as well. And given this is meant to be later used to do
aggregation and comparison across large swaths of production hosts, we
have to have comparable units, which is why nanoseconds and not some
abstract "time cycles".

Does this address your concerns?

Well, it's clearly useful for you guys, but I do worry about it. Even on
servers DVFS is starting to play a significant role. And the TSC is
unaffected by it.

Directly comparing these numbers, esp. across different systems makes no
sense to me. Yes putting them all in [ns] allows for comparison, but
you're still comparing fundamentally different things.

How does it make sense to measure uprobe entry/exit in wall-clock when
it can vary by at least a factor of 2 depending on DVFS. How does it
make sense to compare an x86-64 uprobe entry/exit to an aaargh64 one?

I'm going to implement JIT for aarch64 soon and measuring wall-time can 
bring more info about platforms differences.

Or are you trying to estimate the fraction of overhead spend on
instrumentation instead of real work? Like, this machine spends 5% of
its wall-time in instrumentation, which is effectively not doing work?

The part I'm missing is how using wall-time for these things makes
sense.

I mean, if all you're doing is saying, hey, we appear to be spending X
on this action on this particular system Y doing workload Z (irrespecive
of you then having like a million Ys) and this patch reduces X by half
given the same Y and Z. So patch must be awesome.

This is one of the use-cases. Another one is to show differences across
platforms, and that what usually needs ns.

Then you don't need the conversion to [ns], and the DVFS angle is more
or less mitigated by the whole 'same workload' thing.

We are thinking of this option too, but another point is that it's
usually easier for people to understand nanoseconds rather then
counter values.