Joanne Koong <joannekoong@xxxxxx> writes: > On 11/24/21 1:59 PM, Toke Høiland-Jørgensen wrote: > >> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: >> >>> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >>>> Joanne Koong <joannekoong@xxxxxx> writes: >>>> >>>>> On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote: >>>>> >>>>>> Joanne Koong <joannekoong@xxxxxx> writes: >>>>>> >>>>>>> Add benchmark to measure the throughput and latency of the bpf_loop >>>>>>> call. >>>>>>> >>>>>>> Testing this on qemu on my dev machine on 1 thread, the data is >>>>>>> as follows: >>>>>>> >>>>>>> nr_loops: 1 >>>>>>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op >>>>>>> >>>>>>> nr_loops: 10 >>>>>>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op >>>>>>> >>>>>>> nr_loops: 100 >>>>>>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op >>>>>>> >>>>>>> nr_loops: 500 >>>>>>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op >>>>>>> >>>>>>> nr_loops: 1000 >>>>>>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op >>>>>>> >>>>>>> nr_loops: 5000 >>>>>>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op >>>>>>> >>>>>>> nr_loops: 10000 >>>>>>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op >>>>>>> >>>>>>> nr_loops: 50000 >>>>>>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op >>>>>>> >>>>>>> nr_loops: 100000 >>>>>>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op >>>>>>> >>>>>>> nr_loops: 500000 >>>>>>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op >>>>>>> >>>>>>> nr_loops: 1000000 >>>>>>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op >>>>>>> >>>>>>> From this data, we can see that the latency per loop decreases as the >>>>>>> number of loops increases. On this particular machine, each loop had an >>>>>>> overhead of about ~13 ns, and we were able to run ~70 million loops >>>>>>> per second. >>>>>> The latency figures are great, thanks! I assume these numbers are with >>>>>> retpolines enabled? Otherwise 12ns seems a bit much... Or is this >>>>>> because of qemu? >>>>> I just tested it on a machine (without retpoline enabled) that runs on >>>>> actual >>>>> hardware and here is what I found: >>>>> >>>>> nr_loops: 1 >>>>> bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op >>>>> >>>>> nr_loops: 10 >>>>> bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op >>>>> >>>>> nr_loops: 100 >>>>> bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op >>>>> >>>>> nr_loops: 500 >>>>> bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op >>>>> >>>>> nr_loops: 1000 >>>>> bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op >>>>> >>>>> nr_loops: 5000 >>>>> bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op >>>>> >>>>> nr_loops: 10000 >>>>> bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op >>>>> >>>>> nr_loops: 50000 >>>>> bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op >>>>> >>>>> nr_loops: 100000 >>>>> bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op >>>>> >>>>> nr_loops: 500000 >>>>> bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op >>>>> >>>>> nr_loops: 1000000 >>>>> bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op >>>>> >>>>> The latency is about ~4ns / loop. >>>>> >>>>> I will update the commit message in v3 with these new numbers as well. >>>> Right, awesome, thank you for the additional test. This is closer to >>>> what I would expect: on the hardware I'm usually testing on, a function >>>> call takes ~1.5ns, but the difference might just be the hardware, or >>>> because these are indirect calls. >>>> >>>> Another comparison just occurred to me (but it's totally OK if you don't >>>> want to add any more benchmarks): >>>> >>>> The difference between a program that does: >>>> >>>> bpf_loop(nr_loops, empty_callback, NULL, 0); >>>> >>>> and >>>> >>>> for (i = 0; i < nr_loops; i++) >>>> empty_callback(); >>> You are basically trying to measure the overhead of bpf_loop() helper >>> call itself, because other than that it should be identical. >> No, I'm trying to measure the difference between the indirect call in >> the helper, and the direct call from the BPF program. Should be minor >> without retpolines, and somewhat higher where they are enabled... >> >>> We can estimate that already from the numbers Joanne posted above: >>> >>> nr_loops: 1 >>> bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op >>> nr_loops: 1000 >>> bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op >>> >>> nr_loops:1 is bpf_loop() overhead and one static callback call. >>> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 = >>> 17.6ns. I don't think we need yet another benchmark just for this. >> That seems really high, though? The helper is a pretty simple function, >> and the call to it should just be JIT'ed into a single regular function >> call, right? So why the order-of-magnitude difference? > I think the overhead of triggering the bpf program from the userspace > benchmarking program is also contributing to this. When nr_loops = 1, we > have to do the context switch between userspace + kernel per every 1000 > loops; > this overhead also contributes to the latency numbers above Right okay. But then that data point is not really measuring what it's purporting to measure? That's a bit misleading, so maybe better to leave it out entirely? -Toke