On 7/19/24 6:08 PM, Alexei Starovoitov wrote:
On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
The following are the jited progs with private stack:
subprog:
0: f3 0f 1e fa endbr64
4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
9: 66 90 xchg ax,ax
b: 55 push rbp
c: 48 89 e5 mov rbp,rsp
f: f3 0f 1e fa endbr64
13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670
1a: 60 00 00
1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
24: 02 00
26: 31 c0 xor eax,eax
28: c9 leave
29: c3 ret
Thanks for doing the benchmarking.
It's clear now that worst case overhead is ~5%.
Could you do one more benchmark such that the 'main prog'
below stays as-is with setup of r9 and push/pop r9,
but in the subprog above there is no 'movabs r9 + add r9' ?
To simulate the case when a big function with a large stack
triggers private-stack use, but it calls a subprog without
a private stack.
I think we should see a different overhead.
Obviously subprog won't have these two extra insns that setup r9
which would lead to something like ~4% slowdown vs 5%,
but I feel the overhead of pure push/pop r9 around calls
will be lower as well, because r9 is not written into inside subprog.
The CPU HW should be able to execute such push/pop faster.
I'm curious what it is.
Sure. Let me do an experiment with this.
main prog:
0: f3 0f 1e fa endbr64
4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
9: 66 90 xchg ax,ax
b: 55 push rbp
c: 48 89 e5 mov rbp,rsp
f: f3 0f 1e fa endbr64
13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688
1a: 60 00 00
1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
24: 02 00
26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000
2d: c9 ff ff
30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
34: 48 83 c6 01 add rsi,0x1
38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
3c: 41 51 push r9
3e: e8 46 23 51 e1 call 0xffffffffe1512389
43: 41 59 pop r9
45: 41 51 push r9
47: e8 3d 23 51 e1 call 0xffffffffe1512389
4c: 41 59 pop r9
4e: 41 51 push r9
50: e8 34 23 51 e1 call 0xffffffffe1512389
55: 41 59 pop r9
57: 31 c0 xor eax,eax
59: c9 leave
5a: c3 ret
Also pls share 'perf annotate' of JIT-ed asm.
I wonder where the hotspots are in the code.
Okay, will do.