On Mon, Jul 15, 2024 at 6:17 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > With 4096 loop ierations per program run, I got > $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack > 27.89% bench [kernel.vmlinux] [k] htab_map_hash > 21.55% bench [kernel.vmlinux] [k] _raw_spin_lock > 11.51% bench [kernel.vmlinux] [k] htab_map_delete_elem > 10.26% bench [kernel.vmlinux] [k] htab_map_update_elem > 4.85% bench [kernel.vmlinux] [k] __pcpu_freelist_push > 4.34% bench [kernel.vmlinux] [k] alloc_htab_elem > 3.50% bench [kernel.vmlinux] [k] memcpy_orig > 3.22% bench [kernel.vmlinux] [k] __pcpu_freelist_pop > 2.68% bench [kernel.vmlinux] [k] bcmp > 2.52% bench [kernel.vmlinux] [k] __htab_map_lookup_elem so the prog itself is not even in the top 10 which means that the test doesn't measure anything meaningful about the private stack itself. It just benchmarks hash map and overhead of extra push/pop is invisible. > +SEC("tp/syscalls/sys_enter_getpgid") > +int stack0(void *ctx) > +{ > + struct data_t key = {}, value = {}; > + struct data_t *pvalue; > + int i; > + > + hits++; > + key.d[10] = 5; > + value.d[8] = 10; > + > + for (i = 0; i < batch_iters; i++) { > + pvalue = bpf_map_lookup_elem(&htab, &key); > + if (!pvalue) > + bpf_map_update_elem(&htab, &key, &value, 0); > + bpf_map_delete_elem(&htab, &key); > + } Instead of calling helpers that do a lot of work the test should call global subprograms or noinline static functions that are nops. Only then we might see the overhead of push/pop r9. Once you do that you'll see that +SEC("tp/syscalls/sys_enter_getpgid") approach has too much overhead. (you don't see right now since hashmap dominates). Pls use an approach I mentioned earlier by fentry-ing into a helper and another prog calling that helper in for() loop. pw-bot: cr