Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 24 Nov 2021 11:26:15 -0800

On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>
> Joanne Koong <joannekoong@xxxxxx> writes:
>
> > On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
> >
> >> Joanne Koong <joannekoong@xxxxxx> writes:
> >>
> >>> Add benchmark to measure the throughput and latency of the bpf_loop
> >>> call.
> >>>
> >>> Testing this on qemu on my dev machine on 1 thread, the data is
> >>> as follows:
> >>>
> >>>          nr_loops: 1
> >>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
> >>>
> >>>          nr_loops: 10
> >>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
> >>>
> >>>          nr_loops: 100
> >>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
> >>>
> >>>          nr_loops: 500
> >>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
> >>>
> >>>          nr_loops: 1000
> >>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
> >>>
> >>>          nr_loops: 5000
> >>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
> >>>
> >>>          nr_loops: 10000
> >>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
> >>>
> >>>          nr_loops: 50000
> >>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
> >>>
> >>>          nr_loops: 100000
> >>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
> >>>
> >>>          nr_loops: 500000
> >>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
> >>>
> >>>          nr_loops: 1000000
> >>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
> >>>
> >>>  From this data, we can see that the latency per loop decreases as the
> >>> number of loops increases. On this particular machine, each loop had an
> >>> overhead of about ~13 ns, and we were able to run ~70 million loops
> >>> per second.
> >> The latency figures are great, thanks! I assume these numbers are with
> >> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
> >> because of qemu?
> > I just tested it on a machine (without retpoline enabled) that runs on
> > actual
> > hardware and here is what I found:
> >
> >              nr_loops: 1
> >      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
> >
> >              nr_loops: 10
> >      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
> >
> >              nr_loops: 100
> >      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
> >
> >              nr_loops: 500
> >      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
> >
> >              nr_loops: 1000
> >      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
> >
> >              nr_loops: 5000
> >      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
> >
> >              nr_loops: 10000
> >      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
> >
> >              nr_loops: 50000
> >      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
> >
> >              nr_loops: 100000
> >      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
> >
> >              nr_loops: 500000
> >      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
> >
> >              nr_loops: 1000000
> >      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
> >
> > The latency is about ~4ns / loop.
> >
> > I will update the commit message in v3 with these new numbers as well.
>
> Right, awesome, thank you for the additional test. This is closer to
> what I would expect: on the hardware I'm usually testing on, a function
> call takes ~1.5ns, but the difference might just be the hardware, or
> because these are indirect calls.
>
> Another comparison just occurred to me (but it's totally OK if you don't
> want to add any more benchmarks):
>
> The difference between a program that does:
>
> bpf_loop(nr_loops, empty_callback, NULL, 0);
>
> and
>
> for (i = 0; i < nr_loops; i++)
>   empty_callback();

You are basically trying to measure the overhead of bpf_loop() helper
call itself, because other than that it should be identical. We can
estimate that already from the numbers Joanne posted above:

             nr_loops: 1
      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
             nr_loops: 1000
      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op

nr_loops:1 is bpf_loop() overhead and one static callback call.
bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
17.6ns. I don't think we need yet another benchmark just for this.

>
> should show the difference between the indirect call in the helper and a
> direct call from BPF (and show what the potential performance gain from
> having the verifier inline the helper would be). This was more
> interesting when there was a ~10x delta than a ~2x between your numbers
> and mine, so also totally OK to leave this as-is, and we can cycle back
> to such optimisations if it turns out to be necessary...
>
> -Toke
>