Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Thu, 18 Jul 2024 23:59:09 +0200

On Thu, 18 Jul 2024 at 23:44, Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
>
> On 7/18/24 1:52 PM, Yonghong Song wrote:
> > This patch intends to show some benchmark results comparing a bpf
> > program with vs. without private stack. The patch is not intended
> > to land since it hacks existing kernel interface in order to
> > do proper comparison. The bpf program is similar to
> > 7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks")
> > where a raw_tp program is triggered with bpf_prog_test_run_opts() and
> > the raw_tp program has a loop of helper bpf_get_numa_node_id() which
> > will enable a fentry prog to run. The fentry prog calls three
> > do-nothing functions to maximumly expose the cost of private stack.
> >
> > The following is the jited code for bpf prog in progs/private_stack.c
> > without private stack. The number of batch iterations is 4096.
> >
> > subprog:
> > 0:  f3 0f 1e fa             endbr64
> > 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> > 9:  66 90                   xchg   ax,ax
> > b:  55                      push   rbp
> > c:  48 89 e5                mov    rbp,rsp
> > f:  f3 0f 1e fa             endbr64
> > 13: 31 c0                   xor    eax,eax
> > 15: c9                      leave
> > 16: c3                      ret
> >
> > main prog:
> > 0:  f3 0f 1e fa             endbr64
> > 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> > 9:  66 90                   xchg   ax,ax
> > b:  55                      push   rbp
> > c:  48 89 e5                mov    rbp,rsp
> > f:  f3 0f 1e fa             endbr64
> > 13: 48 bf 00 e0 57 00 00    movabs rdi,0xffffc9000057e000
> > 1a: c9 ff ff
> > 1d: 48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
> > 21: 48 83 c6 01             add    rsi,0x1
> > 25: 48 89 77 00             mov    QWORD PTR [rdi+0x0],rsi
> > 29: e8 6e 00 00 00          call   0x9c
> > 2e: e8 69 00 00 00          call   0x9c
> > 33: e8 64 00 00 00          call   0x9c
> > 38: 31 c0                   xor    eax,eax
> > 3a: c9                      leave
> > 3b: c3                      ret
> >
> > The following are the jited progs with private stack:
> >
> > subprog:
> > 0:  f3 0f 1e fa             endbr64
> > 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> > 9:  66 90                   xchg   ax,ax
> > b:  55                      push   rbp
> > c:  48 89 e5                mov    rbp,rsp
> > f:  f3 0f 1e fa             endbr64
> > 13: 49 b9 70 a6 c1 08 7e    movabs r9,0x607e08c1a670
> > 1a: 60 00 00
> > 1d: 65 4c 03 0c 25 00 1a    add    r9,QWORD PTR gs:0x21a00
> > 24: 02 00
> > 26: 31 c0                   xor    eax,eax
> > 28: c9                      leave
> > 29: c3                      ret
> >
> > main prog:
> > 0:  f3 0f 1e fa             endbr64
> > 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> > 9:  66 90                   xchg   ax,ax
> > b:  55                      push   rbp
> > c:  48 89 e5                mov    rbp,rsp
> > f:  f3 0f 1e fa             endbr64
> > 13: 49 b9 88 a6 c1 08 7e    movabs r9,0x607e08c1a688
> > 1a: 60 00 00
> > 1d: 65 4c 03 0c 25 00 1a    add    r9,QWORD PTR gs:0x21a00
> > 24: 02 00
> > 26: 48 bf 00 d0 5b 00 00    movabs rdi,0xffffc900005bd000
> > 2d: c9 ff ff
> > 30: 48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
> > 34: 48 83 c6 01             add    rsi,0x1
> > 38: 48 89 77 00             mov    QWORD PTR [rdi+0x0],rsi
> > 3c: 41 51                   push   r9
> > 3e: e8 46 23 51 e1          call   0xffffffffe1512389
> > 43: 41 59                   pop    r9
> > 45: 41 51                   push   r9
> > 47: e8 3d 23 51 e1          call   0xffffffffe1512389
> > 4c: 41 59                   pop    r9
> > 4e: 41 51                   push   r9
> > 50: e8 34 23 51 e1          call   0xffffffffe1512389
> > 55: 41 59                   pop    r9
> > 57: 31 c0                   xor    eax,eax
> > 59: c9                      leave
> > 5a: c3                      ret
> >
> >  From the above, it is clear for subprog and main prog,
> > we have some r9 related overhead including retriving the stack
> > in the jit prelog code:
> >    movabs r9,0x607e08c1a688
> >    add    r9,QWORD PTR gs:0x21a00
> > and 'push r9' and 'pop r9' around subprog calls.
> >
> > I did some benchmarking on an intel box (Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz)
> > which has 20 cores and 80 cpus. The number of hits are in the unit
> > of loop iterations.
> >
> > The following are two benchmark results and a few other tries show
> > similar results in terms of variation.
> >    $ ./benchs/run_bench_private_stack.sh
> >    no-private-stack-1:  2.152 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-1:     2.226 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-8:  89.086 ± 0.674M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-8:     90.023 ± 0.117M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-64:  1545.383 ± 3.574M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-64:    1534.630 ± 2.063M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-512:  14591.591 ± 15.202M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-512:   14323.796 ± 13.165M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-2048:  58680.977 ± 46.116M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-2048:  58614.699 ± 22.031M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-4096:  119974.497 ± 90.985M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-4096:  114841.949 ± 59.514M/s (drops 0.000 ± 0.000M/s)
> >    $ ./benchs/run_bench_private_stack.sh
> >    no-private-stack-1:  2.246 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-1:     2.232 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-8:  91.446 ± 0.055M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-8:     90.120 ± 0.069M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-64:  1578.374 ± 1.508M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-64:    1514.909 ± 3.898M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-512:  14767.811 ± 22.399M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-512:   14232.382 ± 227.217M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-2048:  58342.372 ± 81.519M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-2048:  54503.335 ± 160.199M/s (drops 0.000 ± 0.000M/s)
> >    no-private-stack-4096:  117262.975 ± 179.802M/s (drops 0.000 ± 0.000M/s)
> >    private-stack-4096:  114643.523 ± 146.956M/s (drops 0.000 ± 0.000M/s)
> >
> > It is is clear that private-stack is worse than non-private stack up to close 5 percents.
> > This can be roughly estimated based on the above jit code with no-private-stack vs. private-stack.
> >
> > Although the benchmark shows up to 5% potential slowdown with private stack.
> > In reality, the kernel enables private stack only after stack size 64 which means
> > the bpf prog will do some useful things. If bpf prog uses any helper/kfunc, the
> > push/pop r9 overhead should be minimum compared to the overhead of helper/kfunc.
> > if the prog does not use a lot of helper/kfunc, there is no push/pop r9 and
> > the performance should be reasonable too.
> >
> > With 4096 loop ierations per program run, I got
> >    $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
> >    18.47%  bench                                              [k]
> >    17.29%  bench    bpf_trampoline_6442522961                 [k] bpf_trampoline_6442522961
> >    13.33%  bench    bpf_prog_bcf7977d3b93787c_func1           [k] bpf_prog_bcf7977d3b93787c_func1
> >    11.86%  bench    [kernel.vmlinux]                          [k] migrate_enable
> >    11.60%  bench    [kernel.vmlinux]                          [k] __bpf_prog_enter_recur
> >    11.42%  bench    [kernel.vmlinux]                          [k] __bpf_prog_exit_recur
> >     7.87%  bench    [kernel.vmlinux]                          [k] migrate_disable
> >     3.71%  bench    [kernel.vmlinux]                          [k] bpf_get_numa_node_id
> >     3.67%  bench    bpf_prog_d9703036495d54b0_trigger_driver  [k] bpf_prog_d9703036495d54b0_trigger_driver
> >     0.04%  bench    bench                                     [.] btf_validate_type
> >
> >    $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
> >      18.94%  bench                                              [k]
> >      16.88%  bench    bpf_prog_bcf7977d3b93787c_func1           [k] bpf_prog_bcf7977d3b93787c_func1
> >      15.77%  bench    bpf_trampoline_6442522961                 [k] bpf_trampoline_6442522961
> >      11.70%  bench    [kernel.vmlinux]                          [k] __bpf_prog_enter_recur
> >      11.48%  bench    [kernel.vmlinux]                          [k] migrate_enable
> >      11.30%  bench    [kernel.vmlinux]                          [k] __bpf_prog_exit_recur
> >       5.85%  bench    [kernel.vmlinux]                          [k] migrate_disable
> >       3.69%  bench    bpf_prog_d9703036495d54b0_trigger_driver  [k] bpf_prog_d9703036495d54b0_trigger_driver
> >       3.56%  bench    [kernel.vmlinux]                          [k] bpf_get_numa_node_id
> >       0.06%  bench    bench                                     [.] bpf_prog_test_run_opts
> >
> > NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.
>
> I tried with perf built with latest bpf-next and with no-private-stack, the issue still
> exists. Will debug more.
>

Just as an aside, but if this doesn't work, I think you can have a
better signal-to-noise ratio if you try enabling the private stack for
XDP programs and just set up two machines, with a client sending
traffic to another and run xdp-bench [0] on the server. I think you
should observe measurable differences in throughput for
nanosecond-scale changes, especially in programs like drop which do
very little.

[0]: https://github.com/xdp-project/xdp-tools