On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > Joanne Koong <joannekoong@xxxxxx> writes: > > > On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote: > > > >> Joanne Koong <joannekoong@xxxxxx> writes: > >> > >>> Add benchmark to measure the throughput and latency of the bpf_loop > >>> call. > >>> > >>> Testing this on qemu on my dev machine on 1 thread, the data is > >>> as follows: > >>> > >>> nr_loops: 1 > >>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op > >>> > >>> nr_loops: 10 > >>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op > >>> > >>> nr_loops: 100 > >>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op > >>> > >>> nr_loops: 500 > >>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op > >>> > >>> nr_loops: 1000 > >>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op > >>> > >>> nr_loops: 5000 > >>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op > >>> > >>> nr_loops: 10000 > >>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op > >>> > >>> nr_loops: 50000 > >>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op > >>> > >>> nr_loops: 100000 > >>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op > >>> > >>> nr_loops: 500000 > >>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op > >>> > >>> nr_loops: 1000000 > >>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op > >>> > >>> From this data, we can see that the latency per loop decreases as the > >>> number of loops increases. On this particular machine, each loop had an > >>> overhead of about ~13 ns, and we were able to run ~70 million loops > >>> per second. > >> The latency figures are great, thanks! I assume these numbers are with > >> retpolines enabled? Otherwise 12ns seems a bit much... Or is this > >> because of qemu? > > I just tested it on a machine (without retpoline enabled) that runs on > > actual > > hardware and here is what I found: > > > > nr_loops: 1 > > bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op > > > > nr_loops: 10 > > bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op > > > > nr_loops: 100 > > bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op > > > > nr_loops: 500 > > bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op > > > > nr_loops: 1000 > > bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op > > > > nr_loops: 5000 > > bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op > > > > nr_loops: 10000 > > bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op > > > > nr_loops: 50000 > > bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op > > > > nr_loops: 100000 > > bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op > > > > nr_loops: 500000 > > bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op > > > > nr_loops: 1000000 > > bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op > > > > The latency is about ~4ns / loop. > > > > I will update the commit message in v3 with these new numbers as well. > > Right, awesome, thank you for the additional test. This is closer to > what I would expect: on the hardware I'm usually testing on, a function > call takes ~1.5ns, but the difference might just be the hardware, or > because these are indirect calls. > > Another comparison just occurred to me (but it's totally OK if you don't > want to add any more benchmarks): > > The difference between a program that does: > > bpf_loop(nr_loops, empty_callback, NULL, 0); > > and > > for (i = 0; i < nr_loops; i++) > empty_callback(); You are basically trying to measure the overhead of bpf_loop() helper call itself, because other than that it should be identical. We can estimate that already from the numbers Joanne posted above: nr_loops: 1 bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op nr_loops: 1000 bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op nr_loops:1 is bpf_loop() overhead and one static callback call. bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 = 17.6ns. I don't think we need yet another benchmark just for this. > > should show the difference between the indirect call in the helper and a > direct call from BPF (and show what the potential performance gain from > having the verifier inline the helper would be). This was more > interesting when there was a ~10x delta than a ~2x between your numbers > and mine, so also totally OK to leave this as-is, and we can cycle back > to such optimisations if it turns out to be necessary... > > -Toke >