Re: [PATCH bpf-next v1 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 15 Jul 2024 18:35:14 -0700

On Mon, Jul 15, 2024 at 6:17 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
> With 4096 loop ierations per program run, I got
>   $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
>     27.89%  bench    [kernel.vmlinux]                  [k] htab_map_hash
>     21.55%  bench    [kernel.vmlinux]                  [k] _raw_spin_lock
>     11.51%  bench    [kernel.vmlinux]                  [k] htab_map_delete_elem
>     10.26%  bench    [kernel.vmlinux]                  [k] htab_map_update_elem
>      4.85%  bench    [kernel.vmlinux]                  [k] __pcpu_freelist_push
>      4.34%  bench    [kernel.vmlinux]                  [k] alloc_htab_elem
>      3.50%  bench    [kernel.vmlinux]                  [k] memcpy_orig
>      3.22%  bench    [kernel.vmlinux]                  [k] __pcpu_freelist_pop
>      2.68%  bench    [kernel.vmlinux]                  [k] bcmp
>      2.52%  bench    [kernel.vmlinux]                  [k] __htab_map_lookup_elem

so the prog itself is not even in the top 10 which means
that the test doesn't measure anything meaningful about the private
stack itself.
It just benchmarks hash map and overhead of extra push/pop is invisible.

> +SEC("tp/syscalls/sys_enter_getpgid")
> +int stack0(void *ctx)
> +{
> +       struct data_t key = {}, value = {};
> +       struct data_t *pvalue;
> +       int i;
> +
> +       hits++;
> +       key.d[10] = 5;
> +       value.d[8] = 10;
> +
> +       for (i = 0; i < batch_iters; i++) {
> +               pvalue = bpf_map_lookup_elem(&htab, &key);
> +               if (!pvalue)
> +                       bpf_map_update_elem(&htab, &key, &value, 0);
> +               bpf_map_delete_elem(&htab, &key);
> +       }

Instead of calling helpers that do a lot of work the test should
call global subprograms or noinline static functions that are nops.
Only then we might see the overhead of push/pop r9.

Once you do that you'll see that
+SEC("tp/syscalls/sys_enter_getpgid")
approach has too much overhead.
(you don't see right now since hashmap dominates).
Pls use an approach I mentioned earlier by fentry-ing into
a helper and another prog calling that helper in for() loop.

pw-bot: cr