On Thu, Jul 18, 2024 at 1:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > > The following are the jited progs with private stack: > > subprog: > 0: f3 0f 1e fa endbr64 > 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] > 9: 66 90 xchg ax,ax > b: 55 push rbp > c: 48 89 e5 mov rbp,rsp > f: f3 0f 1e fa endbr64 > 13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670 > 1a: 60 00 00 > 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00 > 24: 02 00 > 26: 31 c0 xor eax,eax > 28: c9 leave > 29: c3 ret Thanks for doing the benchmarking. It's clear now that worst case overhead is ~5%. Could you do one more benchmark such that the 'main prog' below stays as-is with setup of r9 and push/pop r9, but in the subprog above there is no 'movabs r9 + add r9' ? To simulate the case when a big function with a large stack triggers private-stack use, but it calls a subprog without a private stack. I think we should see a different overhead. Obviously subprog won't have these two extra insns that setup r9 which would lead to something like ~4% slowdown vs 5%, but I feel the overhead of pure push/pop r9 around calls will be lower as well, because r9 is not written into inside subprog. The CPU HW should be able to execute such push/pop faster. I'm curious what it is. > main prog: > 0: f3 0f 1e fa endbr64 > 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] > 9: 66 90 xchg ax,ax > b: 55 push rbp > c: 48 89 e5 mov rbp,rsp > f: f3 0f 1e fa endbr64 > 13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688 > 1a: 60 00 00 > 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00 > 24: 02 00 > 26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000 > 2d: c9 ff ff > 30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] > 34: 48 83 c6 01 add rsi,0x1 > 38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi > 3c: 41 51 push r9 > 3e: e8 46 23 51 e1 call 0xffffffffe1512389 > 43: 41 59 pop r9 > 45: 41 51 push r9 > 47: e8 3d 23 51 e1 call 0xffffffffe1512389 > 4c: 41 59 pop r9 > 4e: 41 51 push r9 > 50: e8 34 23 51 e1 call 0xffffffffe1512389 > 55: 41 59 pop r9 > 57: 31 c0 xor eax,eax > 59: c9 leave > 5a: c3 ret > Also pls share 'perf annotate' of JIT-ed asm. I wonder where the hotspots are in the code.