On Fri, Mar 15, 2024 at 9:31 AM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Fri, Mar 15, 2024 at 9:03 AM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Thu, Mar 14, 2024 at 10:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > > > Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between > > > one syscall execution and BPF program run. While we use a fast > > > get_pgid() syscall, syscall overhead can still be non-trivial. > > > > > > This patch adds kprobe/fentry set of benchmarks significantly amortizing > > > the cost of syscall vs actual BPF triggering overhead. We do this by > > > employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program > > > which does a tight parameterized loop calling cheap BPF helper > > > (bpf_get_smp_processor_id()), to which kprobe/fentry programs are > > > attached for benchmarking. > > > > > > This way 1 bpf() syscall causes N executions of BPF program being > > > benchmarked. N defaults to 100, but can be adjusted with > > > --trig-batch-iters CLI argument. > > > > > > Results speak for themselves: > > > > > > $ ./run_bench_trigger.sh > > > uprobe-base : 138.054 ± 0.556M/s > > > base : 16.650 ± 0.123M/s > > > > What's going on here? Why two bases are so different? > > I thought the "base" is what all other benchmarks > > should be compared against. > > The "base" is the theoretical maximum for all benchs. > > Or uprobe* benches should be compared with uprobe-base > > while all other benchs compared with 'base' ? > > Probably not anymore due to this new approach. > > The 'base' is kinda lost its value then. > > naming is hard. base is doing syscall(get_pgid) in a tight loop. It's > base compared to previous trigger benchmarks where we used syscall to > trigger kprobe/fentry programs. uprobe-base is just a user-space loop > that does atomic_inc() in a tight loop. So uprobe-base is basically > the measure of how fast CPU is, but it's unrealistic to expect either > fentry/kprobe to get close, and especially it's unrealistic to expect > uprobes to get close to it. > > Naming suggestions are welcome, though. > > I'm not sure what the "base" should be for xxx-fast benchmarks? Doing > a counter loop in driver BPF program, perhaps? Would you like me to > add base-fast benchmark doing just that? > How about this. base -> base-syscall (i.e., "syscall-calling baseline") uprobe-base -> base-user-loop (i.e., "user space-only baseline") and then for "fast" baseline we add "base-kernel-loop" for "kernel-side looping baseline" Or we can use some naming based on "counting": base-syscall-count, base-user-count, base-kernel-count? Any preferences, anyone? > > > > > > tp : 11.068 ± 0.100M/s > > > rawtp : 14.087 ± 0.511M/s > > > kprobe : 9.641 ± 0.027M/s > > > kprobe-multi : 10.263 ± 0.061M/s > > > kretprobe : 5.475 ± 0.028M/s > > > kretprobe-multi : 5.703 ± 0.036M/s > > > fentry : 14.544 ± 0.112M/s > > > fexit : 10.637 ± 0.073M/s > > > fmodret : 11.357 ± 0.061M/s > > > kprobe-fast : 14.286 ± 0.377M/s > > > kprobe-multi-fast : 14.999 ± 0.204M/s > > > kretprobe-fast : 7.646 ± 0.084M/s > > > kretprobe-multi-fast: 4.354 ± 0.066M/s > > > fentry-fast : 31.475 ± 0.254M/s > > > fexit-fast : 17.379 ± 0.195M/s > > > > I think the "-fast" suffix doesn't really fit here. > > It's a different way to benchmark fexit vs kprobe overhead. > > > > I think the old approach should be replaced with the new one. > > There is no value in keeping the old stuff around when > > now it's clear that it's measuring much more than it should. > > I'd say we should keep them. Note that fmod_ret, tp, and raw_tp can't > be benchmarked the same way. So we don't have -fast variants for them. > If we get rid of pre-existing fentry/kprobe benchmarks, we won't be > able to compare raw_tp/fmod_ret/fexit.s to them. > > And to be clear, in practice I'm not comparing xxx-fast with xxx at > all. I just put all the numbers in this patch summary to show that we > indeed have a significant performance lost on syscall. > > In practice when comparing kprobe vs fentry, it should be kprobe-fast > vs fentry-fast. > > > > > > One huge and not yet explained deviation is a slowdown of > > > kretprobe-multi, we should look into that separately. > > > > > > kretprobe : 5.475 ± 0.028M/s > > > kretprobe-multi : 5.703 ± 0.036M/s > > > kretprobe-fast : 7.646 ± 0.084M/s > > > kretprobe-multi-fast: 4.354 ± 0.066M/s > > > > That is weird indeed. Since it's a single cpu run > > it cannot be due to contention. > > My gut feel is that something odd is here: > > #ifdef CONFIG_X86_KERNEL_IBT > > static unsigned long get_entry_ip(unsigned long fentry_ip) > > > > Did you have it in your config? > > No, this was without IBT. I do have an upcoming small patch to > eliminate unnecessary copy_kernel_nofault() in IBT path, but I don't > think this is it. It should be pretty clear with profiling where we > are losing CPU, I just didn't have time/desire to look into it, as I > have tons of small things being worked on in parallel and didn't want > to add another one. > > > > > > +SEC("raw_tp") > > > +int trigger_driver(void *ctx) > > > +{ > > > + int i; > > > + > > > + for (i = 0; i < batch_iters; i++) > > > + (void)bpf_get_smp_processor_id(); /* attach here to benchmark */ > > > > bpf_get_numa_node_id() is probably even faster and > > not subject to DEBUG configs. > > Sure, I can switch to that, np.