Re: [PATCH bpf-next v5 1/4] bpf: add bpf_get_cpu_cycles kfunc

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 12 Nov 2024 15:59:45 -0800

On Tue, Nov 12, 2024 at 1:43 PM Vadim Fedorenko
<vadim.fedorenko@xxxxxxxxx> wrote:
>
> On 12/11/2024 05:50, Andrii Nakryiko wrote:
> > On Fri, Nov 8, 2024 at 4:42 PM Vadim Fedorenko <vadfed@xxxxxxxx> wrote:
> >>
> >> New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
> >> it into rdtsc ordered call. Other architectures will get JIT
> >> implementation too if supported. The fallback is to
> >> __arch_get_hw_counter().
> >>
> >> Signed-off-by: Vadim Fedorenko <vadfed@xxxxxxxx>
> >> ---
> >
> > nit: please add cover letter for the next revision, multi-patch sets
> > generally should come with a cover letter, unless it's some set of
> > trivial and mostly independent patches. Anyways...
>
> Yeah, sure. This series has grown from the first small version...
>
> >
> > I haven't yet looked through the code (yet), but I was curious to
> > benchmark the perf benefit, so that's what I did for fun this evening.
> >
> > (!!!) BTW, a complete aside, but I think it's interesting. It turned
> > out that using bpf_test_prog_run() scales *VERY POORLY* with large
> > number of CPUs, because we start spending tons of time in
> > fdget()/fdput(), so I initially got pretty unscalable results,
> > profiled a bit, and then switched to just doing
> > syscall(syscall(__NR_getpgid); + SEC("raw_tp/sys_enter")). Anyways,
> > the point is that microbenchmarking is tricky and we need to improve
> > our existing bench setup for some benchmarks. Anyways, getting back to
> > the main topic.
> >
> > I wrote a quick two benchmarks testing what I see as intended use case
> > for these kfuncs (batching amortizes the cost of triggering BPF
> > program, batch_iters = 500 in my case):
> >
> > SEC("?raw_tp/sys_enter")
> > int trigger_driver_ktime(void *ctx)
> > {
> >          volatile __u64 total = 0;
> >          int i;
> >
> >          for (i = 0; i < batch_iters; i++) {
> >                  __u64 start, end;
> >
> >                  start = bpf_ktime_get_ns();
> >                  end = bpf_ktime_get_ns();
> >                  total += end - start;
> >          }
> >          inc_counter_batch(batch_iters);
> >
> >          return 0;
> > }
> >
> > extern __u64 bpf_get_cpu_cycles(void) __weak __ksym;
> > extern __u64 bpf_cpu_cycles_to_ns(__u64 cycles) __weak __ksym;
> >
> > SEC("?raw_tp/sys_enter")
> > int trigger_driver_cycles(void *ctx)
> > {
> >          volatile __u64 total = 0;
> >          int i;
> >
> >          for (i = 0; i < batch_iters; i++) {
> >                  __u64 start, end;
> >
> >                  start = bpf_get_cpu_cycles();
> >                  end = bpf_get_cpu_cycles();
> >                  total += bpf_cpu_cycles_to_ns(end - start);
> >          }
> >          inc_counter_batch(batch_iters);
> >
> >          return 0;
> > }
> >
> > And here's what I got across multiple numbers of parallel CPUs on our
> > production host.
> >
> > # ./bench_timing.sh
> >
> > ktime                 ( 1 cpus):   32.286 ± 0.309M/s  ( 32.286M/s/cpu)
> > ktime                 ( 2 cpus):   63.021 ± 0.538M/s  ( 31.511M/s/cpu)
> > ktime                 ( 3 cpus):   94.211 ± 0.686M/s  ( 31.404M/s/cpu)
> > ktime                 ( 4 cpus):  124.757 ± 0.691M/s  ( 31.189M/s/cpu)
> > ktime                 ( 5 cpus):  154.855 ± 0.693M/s  ( 30.971M/s/cpu)
> > ktime                 ( 6 cpus):  185.551 ± 2.304M/s  ( 30.925M/s/cpu)
> > ktime                 ( 7 cpus):  211.117 ± 4.755M/s  ( 30.160M/s/cpu)
> > ktime                 ( 8 cpus):  236.454 ± 0.226M/s  ( 29.557M/s/cpu)
> > ktime                 (10 cpus):  295.526 ± 0.126M/s  ( 29.553M/s/cpu)
> > ktime                 (12 cpus):  322.282 ± 0.153M/s  ( 26.857M/s/cpu)
> > ktime                 (14 cpus):  375.347 ± 0.087M/s  ( 26.811M/s/cpu)
> > ktime                 (16 cpus):  399.813 ± 0.181M/s  ( 24.988M/s/cpu)
> > ktime                 (24 cpus):  617.675 ± 7.053M/s  ( 25.736M/s/cpu)
> > ktime                 (32 cpus):  819.695 ± 0.231M/s  ( 25.615M/s/cpu)
> > ktime                 (40 cpus):  996.264 ± 0.290M/s  ( 24.907M/s/cpu)
> > ktime                 (48 cpus): 1180.201 ± 0.160M/s  ( 24.588M/s/cpu)
> > ktime                 (56 cpus): 1321.084 ± 0.099M/s  ( 23.591M/s/cpu)
> > ktime                 (64 cpus): 1482.061 ± 0.121M/s  ( 23.157M/s/cpu)
> > ktime                 (72 cpus): 1666.540 ± 0.460M/s  ( 23.146M/s/cpu)
> > ktime                 (80 cpus): 1851.419 ± 0.439M/s  ( 23.143M/s/cpu)
> >
> > cycles                ( 1 cpus):   45.815 ± 0.018M/s  ( 45.815M/s/cpu)
> > cycles                ( 2 cpus):   86.706 ± 0.068M/s  ( 43.353M/s/cpu)
> > cycles                ( 3 cpus):  129.899 ± 0.101M/s  ( 43.300M/s/cpu)
> > cycles                ( 4 cpus):  168.435 ± 0.073M/s  ( 42.109M/s/cpu)
> > cycles                ( 5 cpus):  210.520 ± 0.164M/s  ( 42.104M/s/cpu)
> > cycles                ( 6 cpus):  252.596 ± 0.050M/s  ( 42.099M/s/cpu)
> > cycles                ( 7 cpus):  294.356 ± 0.159M/s  ( 42.051M/s/cpu)
> > cycles                ( 8 cpus):  317.167 ± 0.163M/s  ( 39.646M/s/cpu)
> > cycles                (10 cpus):  396.141 ± 0.208M/s  ( 39.614M/s/cpu)
> > cycles                (12 cpus):  431.938 ± 0.511M/s  ( 35.995M/s/cpu)
> > cycles                (14 cpus):  503.055 ± 0.070M/s  ( 35.932M/s/cpu)
> > cycles                (16 cpus):  534.261 ± 0.107M/s  ( 33.391M/s/cpu)
> > cycles                (24 cpus):  836.838 ± 0.141M/s  ( 34.868M/s/cpu)
> > cycles                (32 cpus): 1099.689 ± 0.314M/s  ( 34.365M/s/cpu)
> > cycles                (40 cpus): 1336.573 ± 0.015M/s  ( 33.414M/s/cpu)
> > cycles                (48 cpus): 1571.734 ± 11.151M/s  ( 32.744M/s/cpu)
> > cycles                (56 cpus): 1819.242 ± 4.627M/s  ( 32.486M/s/cpu)
> > cycles                (64 cpus): 2046.285 ± 5.169M/s  ( 31.973M/s/cpu)
> > cycles                (72 cpus): 2287.683 ± 0.787M/s  ( 31.773M/s/cpu)
> > cycles                (80 cpus): 2505.414 ± 0.626M/s  ( 31.318M/s/cpu)
> >
> > So, from about +42% on a single CPU, to +36% at 80 CPUs. Not that bad.
> > Scalability-wise, we still see some drop off in performance, but
> > believe me, with bpf_prog_test_run() it was so much worse :) I also
> > verified that now we spend cycles almost exclusively inside the BPF
> > program (so presumably in those benchmarked kfuncs).
>
> Am I right that the numbers show how many iterations were done during
> the very same amount of time? It would be also great to understand if

yeah, it's millions of operations per second

> we get more precise measurements - just in case you have your tests
> ready...
>

not sure what precision you mean here?... I didn't plan to publish
tests, but I can push what I have into a local branch if you'd like to
play with this