Re: [PATCH bpf-next v1 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack

Yonghong Song <yonghong.song@xxxxxxxxx> · Tue, 16 Jul 2024 10:45:51 -0700

On 7/15/24 6:35 PM, Alexei Starovoitov wrote:
On Mon, Jul 15, 2024 at 6:17 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
With 4096 loop ierations per program run, I got
   $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
     27.89%  bench    [kernel.vmlinux]                  [k] htab_map_hash
     21.55%  bench    [kernel.vmlinux]                  [k] _raw_spin_lock
     11.51%  bench    [kernel.vmlinux]                  [k] htab_map_delete_elem
     10.26%  bench    [kernel.vmlinux]                  [k] htab_map_update_elem
      4.85%  bench    [kernel.vmlinux]                  [k] __pcpu_freelist_push
      4.34%  bench    [kernel.vmlinux]                  [k] alloc_htab_elem
      3.50%  bench    [kernel.vmlinux]                  [k] memcpy_orig
      3.22%  bench    [kernel.vmlinux]                  [k] __pcpu_freelist_pop
      2.68%  bench    [kernel.vmlinux]                  [k] bcmp
      2.52%  bench    [kernel.vmlinux]                  [k] __htab_map_lookup_elem

so the prog itself is not even in the top 10 which means
that the test doesn't measure anything meaningful about the private
stack itself.
It just benchmarks hash map and overhead of extra push/pop is invisible.

+SEC("tp/syscalls/sys_enter_getpgid")
+int stack0(void *ctx)
+{
+       struct data_t key = {}, value = {};
+       struct data_t *pvalue;
+       int i;
+
+       hits++;
+       key.d[10] = 5;
+       value.d[8] = 10;
+
+       for (i = 0; i < batch_iters; i++) {
+               pvalue = bpf_map_lookup_elem(&htab, &key);
+               if (!pvalue)
+                       bpf_map_update_elem(&htab, &key, &value, 0);
+               bpf_map_delete_elem(&htab, &key);
+       }
Instead of calling helpers that do a lot of work the test should
call global subprograms or noinline static functions that are nops.
Only then we might see the overhead of push/pop r9.

Once you do that you'll see that
+SEC("tp/syscalls/sys_enter_getpgid")
approach has too much overhead.
(you don't see right now since hashmap dominates).
Pls use an approach I mentioned earlier by fentry-ing into
a helper and another prog calling that helper in for() loop.

Thanks for suggestion. Will use fentry program with empty functions
to test maximum worst performance.

pw-bot: cr