On Fri, Mar 15, 2024 at 9:03 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Thu, Mar 14, 2024 at 10:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between > > one syscall execution and BPF program run. While we use a fast > > get_pgid() syscall, syscall overhead can still be non-trivial. > > > > This patch adds kprobe/fentry set of benchmarks significantly amortizing > > the cost of syscall vs actual BPF triggering overhead. We do this by > > employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program > > which does a tight parameterized loop calling cheap BPF helper > > (bpf_get_smp_processor_id()), to which kprobe/fentry programs are > > attached for benchmarking. > > > > This way 1 bpf() syscall causes N executions of BPF program being > > benchmarked. N defaults to 100, but can be adjusted with > > --trig-batch-iters CLI argument. > > > > Results speak for themselves: > > > > $ ./run_bench_trigger.sh > > uprobe-base : 138.054 ± 0.556M/s > > base : 16.650 ± 0.123M/s > > What's going on here? Why two bases are so different? > I thought the "base" is what all other benchmarks > should be compared against. > The "base" is the theoretical maximum for all benchs. > Or uprobe* benches should be compared with uprobe-base > while all other benchs compared with 'base' ? > Probably not anymore due to this new approach. > The 'base' is kinda lost its value then. naming is hard. base is doing syscall(get_pgid) in a tight loop. It's base compared to previous trigger benchmarks where we used syscall to trigger kprobe/fentry programs. uprobe-base is just a user-space loop that does atomic_inc() in a tight loop. So uprobe-base is basically the measure of how fast CPU is, but it's unrealistic to expect either fentry/kprobe to get close, and especially it's unrealistic to expect uprobes to get close to it. Naming suggestions are welcome, though. I'm not sure what the "base" should be for xxx-fast benchmarks? Doing a counter loop in driver BPF program, perhaps? Would you like me to add base-fast benchmark doing just that? > > > tp : 11.068 ± 0.100M/s > > rawtp : 14.087 ± 0.511M/s > > kprobe : 9.641 ± 0.027M/s > > kprobe-multi : 10.263 ± 0.061M/s > > kretprobe : 5.475 ± 0.028M/s > > kretprobe-multi : 5.703 ± 0.036M/s > > fentry : 14.544 ± 0.112M/s > > fexit : 10.637 ± 0.073M/s > > fmodret : 11.357 ± 0.061M/s > > kprobe-fast : 14.286 ± 0.377M/s > > kprobe-multi-fast : 14.999 ± 0.204M/s > > kretprobe-fast : 7.646 ± 0.084M/s > > kretprobe-multi-fast: 4.354 ± 0.066M/s > > fentry-fast : 31.475 ± 0.254M/s > > fexit-fast : 17.379 ± 0.195M/s > > I think the "-fast" suffix doesn't really fit here. > It's a different way to benchmark fexit vs kprobe overhead. > > I think the old approach should be replaced with the new one. > There is no value in keeping the old stuff around when > now it's clear that it's measuring much more than it should. I'd say we should keep them. Note that fmod_ret, tp, and raw_tp can't be benchmarked the same way. So we don't have -fast variants for them. If we get rid of pre-existing fentry/kprobe benchmarks, we won't be able to compare raw_tp/fmod_ret/fexit.s to them. And to be clear, in practice I'm not comparing xxx-fast with xxx at all. I just put all the numbers in this patch summary to show that we indeed have a significant performance lost on syscall. In practice when comparing kprobe vs fentry, it should be kprobe-fast vs fentry-fast. > > > One huge and not yet explained deviation is a slowdown of > > kretprobe-multi, we should look into that separately. > > > > kretprobe : 5.475 ± 0.028M/s > > kretprobe-multi : 5.703 ± 0.036M/s > > kretprobe-fast : 7.646 ± 0.084M/s > > kretprobe-multi-fast: 4.354 ± 0.066M/s > > That is weird indeed. Since it's a single cpu run > it cannot be due to contention. > My gut feel is that something odd is here: > #ifdef CONFIG_X86_KERNEL_IBT > static unsigned long get_entry_ip(unsigned long fentry_ip) > > Did you have it in your config? No, this was without IBT. I do have an upcoming small patch to eliminate unnecessary copy_kernel_nofault() in IBT path, but I don't think this is it. It should be pretty clear with profiling where we are losing CPU, I just didn't have time/desire to look into it, as I have tons of small things being worked on in parallel and didn't want to add another one. > > > +SEC("raw_tp") > > +int trigger_driver(void *ctx) > > +{ > > + int i; > > + > > + for (i = 0; i < batch_iters; i++) > > + (void)bpf_get_smp_processor_id(); /* attach here to benchmark */ > > bpf_get_numa_node_id() is probably even faster and > not subject to DEBUG configs. Sure, I can switch to that, np.