Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 19 Jul 2024 18:08:59 -0700

On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
>
> The following are the jited progs with private stack:
>
> subprog:
> 0:  f3 0f 1e fa             endbr64
> 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> 9:  66 90                   xchg   ax,ax
> b:  55                      push   rbp
> c:  48 89 e5                mov    rbp,rsp
> f:  f3 0f 1e fa             endbr64
> 13: 49 b9 70 a6 c1 08 7e    movabs r9,0x607e08c1a670
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a    add    r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 31 c0                   xor    eax,eax
> 28: c9                      leave
> 29: c3                      ret

Thanks for doing the benchmarking.
It's clear now that worst case overhead is ~5%.
Could you do one more benchmark such that the 'main prog'
below stays as-is with setup of r9 and push/pop r9,
but in the subprog above there is no 'movabs r9 + add r9' ?
To simulate the case when a big function with a large stack
triggers private-stack use, but it calls a subprog without
a private stack.
I think we should see a different overhead.
Obviously subprog won't have these two extra insns that setup r9
which would lead to something like ~4% slowdown vs 5%,
but I feel the overhead of pure push/pop r9 around calls
will be lower as well, because r9 is not written into inside subprog.
The CPU HW should be able to execute such push/pop faster.
I'm curious what it is.

> main prog:
> 0:  f3 0f 1e fa             endbr64
> 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> 9:  66 90                   xchg   ax,ax
> b:  55                      push   rbp
> c:  48 89 e5                mov    rbp,rsp
> f:  f3 0f 1e fa             endbr64
> 13: 49 b9 88 a6 c1 08 7e    movabs r9,0x607e08c1a688
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a    add    r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 48 bf 00 d0 5b 00 00    movabs rdi,0xffffc900005bd000
> 2d: c9 ff ff
> 30: 48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
> 34: 48 83 c6 01             add    rsi,0x1
> 38: 48 89 77 00             mov    QWORD PTR [rdi+0x0],rsi
> 3c: 41 51                   push   r9
> 3e: e8 46 23 51 e1          call   0xffffffffe1512389
> 43: 41 59                   pop    r9
> 45: 41 51                   push   r9
> 47: e8 3d 23 51 e1          call   0xffffffffe1512389
> 4c: 41 59                   pop    r9
> 4e: 41 51                   push   r9
> 50: e8 34 23 51 e1          call   0xffffffffe1512389
> 55: 41 59                   pop    r9
> 57: 31 c0                   xor    eax,eax
> 59: c9                      leave
> 5a: c3                      ret
>

Also pls share 'perf annotate' of JIT-ed asm.
I wonder where the hotspots are in the code.