On Fri, Jul 10, 2020 at 4:56 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Thu, Jul 02, 2020 at 03:49:29PM +0200, Maciej Fijalkowski wrote: > > This commit serves two things: > > 1) it optimizes BPF prologue/epilogue generation > > 2) it makes possible to have tailcalls within BPF subprogram > > > > Both points are related to each other since without 1), 2) could not be > > achieved. > > > > In [1], Alexei says: > > "The prologue will look like: > > nop5 > > xor eax,eax // two new bytes if bpf_tail_call() is used in this > > // function > > push rbp > > mov rbp, rsp > > sub rsp, rounded_stack_depth > > push rax // zero init tail_call counter > > variable number of push rbx,r13,r14,r15 > > > > Then bpf_tail_call will pop variable number rbx,.. > > and final 'pop rax' > > Then 'add rsp, size_of_current_stack_frame' > > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov > > rbp, rsp' > > > > This way new function will set its own stack size and will init tail > > call > > counter with whatever value the parent had. > > > > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'. > > Instead it would need to have 'nop2' in there." > > > > Implement that suggestion. > > > > Since the layout of stack is changed, tail call counter handling can not > > rely anymore on popping it to rbx just like it have been handled for > > constant prologue case and later overwrite of rbx with actual value of > > rbx pushed to stack. Therefore, let's use one of the register (%rcx) that > > is considered to be volatile/caller-saved and pop the value of tail call > > counter in there in the epilogue. > > > > Drop the BUILD_BUG_ON in emit_prologue and in > > emit_bpf_tail_call_indirect where instruction layout is not constant > > anymore. > > > > Introduce new poke target, 'ip_aux' to poke descriptor that is dedicated > > imo ip_aux approach has too much x86 specific code in kernel/bpf/arraymap.c > Ex. NOP_ATOMIC5 is x86 only and will break build on all other archs. > > But I'm not sure ip_aux is really necessary. > It's nice to optimize the case when tail_call target is NULL, but > redundant unwind + nop5 + push_regs_again makes for much simpler design > without worrying about state transitions. > > So I don't think optimizing the case of target==NULL is really worth the complexity. > > > for skipping the register pops and stack unwind that are generated right > > before the actual jump to target program. > > For case when the target program is not present, BPF program will skip > > the pop instructions and nop5 dedicated for jmpq $target. An example of > > such state when only R6 of callee saved registers is used by program: > > > > ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4 > > ffffffffc0513aa6: 5b pop %rbx > > ffffffffc0513aa7: 58 pop %rax > > ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp > > ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > ffffffffc0513ab4: 48 89 df mov %rbx,%rdi > > so this last rbx->rdi insn is not part of bpf_tail_call insn, right? > That is just 'R1 = R6;' bpf insn jited. > > > > > When target program is inserted, the jump that was there to skip > > pops/nop5 will become the nop5, so CPU will go over pops and do the > > actual tailcall. > > > > One might ask why there simply can not be pushes after the nop5? > > exactly... and... > > > In the following example snippet: > > > > ffffffffc037030c: 48 89 fb mov %rdi,%rbx > > (...) > > ffffffffc0370332: 5b pop %rbx > > ffffffffc0370333: 58 pop %rax > > ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp > > ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp > > ffffffffc0370347: 50 push %rax > > ffffffffc0370348: 53 push %rbx > > ffffffffc0370349: 48 89 df mov %rbx,%rdi > > ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548 > > > > There is the bpf2bpf call right after the tailcall and jump target is > > not present. ctx is %rbx and BPF subprogram that we will call into on > > ffffffffc037034c is relying on it, e.g. it will pick ctx from there. > > Such code layout is therefore broken as we would overwrite the content > > of %rbx with the value that was pushed on the prologue. > > I don't understand above explanation. > Are you saying 'callq 0xffffffffc0372548' above is a call to bpf subprogram? > The 'mov %rbx,%rdi' was 'R1 = R6' before JIT. > The code is storing ctx into R1 to pass into bpf subprogram. > It's not part of proposed emit_bpf_tail_call_direct() handling. > It's part of BPF program. > I don't see what breaks. > > The new emit_bpf_tail_call_indirect() looks correct to me. > > But emit_bpf_tail_call_direct() doesn't need > + emit_jump(&prog, (u8 *)poke->ip + X86_PATCH_SIZE, poke->ip_aux); > and messy poke->ip_aux. > > It can do: > pop_callee_regs() > memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE); > push_callee_regs() > > When target is NULL the pairs of pop/push overall is a nop. > They don't affect correctness. > When prog_array_map_poke_run() is called it will replace a nop5 > with a jump. So still all good. > > Yes there will be tiny overhead when tail_call target is NULL, > but x86 will execute pop/push pair in _one_ cpu cycle. > As far as I recall x86 hardware has special logic to recognize > such push/pop sequences so they are really fast. > > What am I missing? Of course you are right. pop+nop+push is incorrect. How about the following instead: - during JIT: emit_jump(to_skip_below) <- poke->tailcall_bypass pop_callee_regs emit_jump(to_tailcall_target) <- poke->tailcall_target - Transition from one target to another: text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp) if (new_jmp != NULL) text_poke(poke->tailcall_bypass, MOD jmp into nop); else text_poke(poke->tailcall_bypass, MOD nop into jmp); In other words, let's keep jmp as always valid, so the race you've described in the cover letter won't ever happen. The kernel/bpf/arraymap.c will stay arch independent too. Thoughts?