Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack

Yonghong Song <yonghong.song@xxxxxxxxx> · Mon, 22 Jul 2024 09:33:40 -0700

On 7/19/24 6:08 PM, Alexei Starovoitov wrote:
On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:

The following are the jited progs with private stack:

subprog:
0:  f3 0f 1e fa             endbr64
4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
9:  66 90                   xchg   ax,ax
b:  55                      push   rbp
c:  48 89 e5                mov    rbp,rsp
f:  f3 0f 1e fa             endbr64
13: 49 b9 70 a6 c1 08 7e    movabs r9,0x607e08c1a670
1a: 60 00 00
1d: 65 4c 03 0c 25 00 1a    add    r9,QWORD PTR gs:0x21a00
24: 02 00
26: 31 c0                   xor    eax,eax
28: c9                      leave
29: c3                      ret
Thanks for doing the benchmarking.
It's clear now that worst case overhead is ~5%.
Could you do one more benchmark such that the 'main prog'
below stays as-is with setup of r9 and push/pop r9,
but in the subprog above there is no 'movabs r9 + add r9' ?
To simulate the case when a big function with a large stack
triggers private-stack use, but it calls a subprog without
a private stack.
I think we should see a different overhead.
Obviously subprog won't have these two extra insns that setup r9
which would lead to something like ~4% slowdown vs 5%,
but I feel the overhead of pure push/pop r9 around calls
will be lower as well, because r9 is not written into inside subprog.
The CPU HW should be able to execute such push/pop faster.
I'm curious what it is.
Sure. Let me do an experiment with this.

main prog:
0:  f3 0f 1e fa             endbr64
4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
9:  66 90                   xchg   ax,ax
b:  55                      push   rbp
c:  48 89 e5                mov    rbp,rsp
f:  f3 0f 1e fa             endbr64
13: 49 b9 88 a6 c1 08 7e    movabs r9,0x607e08c1a688
1a: 60 00 00
1d: 65 4c 03 0c 25 00 1a    add    r9,QWORD PTR gs:0x21a00
24: 02 00
26: 48 bf 00 d0 5b 00 00    movabs rdi,0xffffc900005bd000
2d: c9 ff ff
30: 48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
34: 48 83 c6 01             add    rsi,0x1
38: 48 89 77 00             mov    QWORD PTR [rdi+0x0],rsi
3c: 41 51                   push   r9
3e: e8 46 23 51 e1          call   0xffffffffe1512389
43: 41 59                   pop    r9
45: 41 51                   push   r9
47: e8 3d 23 51 e1          call   0xffffffffe1512389
4c: 41 59                   pop    r9
4e: 41 51                   push   r9
50: e8 34 23 51 e1          call   0xffffffffe1512389
55: 41 59                   pop    r9
57: 31 c0                   xor    eax,eax
59: c9                      leave
5a: c3                      ret

Also pls share 'perf annotate' of JIT-ed asm.
I wonder where the hotspots are in the code.
Okay, will do.