On Fri, Jul 17, 2020 at 01:06:07AM +0200, Daniel Borkmann wrote: > On 7/16/20 1:36 AM, Maciej Fijalkowski wrote: > > This commit serves two things: > > 1) it optimizes BPF prologue/epilogue generation > > 2) it makes possible to have tailcalls within BPF subprogram > > > > Both points are related to each other since without 1), 2) could not be > > achieved. > > > > In [1], Alexei says: > > "The prologue will look like: > > nop5 > > xor eax,eax // two new bytes if bpf_tail_call() is used in this > > // function > > push rbp > > mov rbp, rsp > > sub rsp, rounded_stack_depth > > push rax // zero init tail_call counter > > variable number of push rbx,r13,r14,r15 > > > > Then bpf_tail_call will pop variable number rbx,.. > > and final 'pop rax' > > Then 'add rsp, size_of_current_stack_frame' > > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov > > rbp, rsp' > > > > This way new function will set its own stack size and will init tail > > call > > counter with whatever value the parent had. > > > > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'. > > Instead it would need to have 'nop2' in there." > > > > Implement that suggestion. > > > > Since the layout of stack is changed, tail call counter handling can not > > rely anymore on popping it to rbx just like it have been handled for > > constant prologue case and later overwrite of rbx with actual value of > > rbx pushed to stack. Therefore, let's use one of the register (%rcx) that > > is considered to be volatile/caller-saved and pop the value of tail call > > counter in there in the epilogue. > > > > Drop the BUILD_BUG_ON in emit_prologue and in > > emit_bpf_tail_call_indirect where instruction layout is not constant > > anymore. > > > > Introduce new poke target, 'tailcall_bypass' to poke descriptor that is > > dedicated for skipping the register pops and stack unwind that are > > generated right before the actual jump to target program. Reflect also > > the actual purpose of poke->ip and rename it to poke->tailcall_target so > > that it will not the be confused with the poke target that is being > > introduced here. > > For case when the target program is not present, BPF program will skip > > the pop instructions and nop5 dedicated for jmpq $target. An example of > > such state when only R6 of callee saved registers is used by program: > > > > ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4 > > ffffffffc0513aa6: 5b pop %rbx > > ffffffffc0513aa7: 58 pop %rax > > ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp > > ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > ffffffffc0513ab4: 48 89 df mov %rbx,%rdi > > > > When target program is inserted, the jump that was there to skip > > pops/nop5 will become the nop5, so CPU will go over pops and do the > > actual tailcall. > > > > One might ask why there simply can not be pushes after the nop5? > > In the following example snippet: > > > > ffffffffc037030c: 48 89 fb mov %rdi,%rbx > > (...) > > ffffffffc0370332: 5b pop %rbx > > ffffffffc0370333: 58 pop %rax > > ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp > > ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp > > ffffffffc0370347: 50 push %rax > > ffffffffc0370348: 53 push %rbx > > ffffffffc0370349: 48 89 df mov %rbx,%rdi > > ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548 > > > > There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall > > and jump target is not present. ctx is in %rbx register and BPF > > subprogram that we will call into on ffffffffc037034c is relying on it, > > e.g. it will pick ctx from there. Such code layout is therefore broken > > as we would overwrite the content of %rbx with the value that was pushed > > on the prologue. That is the reason for the 'bypass' approach. > > > > Special care needs to be taken during the install/update/remove of > > tailcall target. In case when target program is not present, the CPU > > must not execute the pop instructions that precede the tailcall. > > > > To address that, the following states can be defined: > > A nop, unwind, nop > > B nop, unwind, tail > > C skip, unwind, nop > > D skip, unwind, tail > > > > A is forbidden (lead to incorrectness). The state transitions between > > tailcall install/update/remove will work as follows: > > > > First install tail call f: C->D->B(f) > > * poke the tailcall, after that get rid of the skip > > Update tail call f to f': B(f)->B(f') > > * poke the tailcall (poke->tailcall_target) and do NOT touch the > > poke->tailcall_bypass > > Remove tail call: B(f')->C(f') > > * poke->tailcall_bypass is poked back to jump, then we wait the RCU > > grace period so that other programs will finish its execution and > > after that we are safe to remove the poke->tailcall_target > > Install new tail call (f''): C(f')->D(f'')->B(f''). > > * same as first step > > > > This way CPU can never be exposed to "unwind, tail" state. > > > > For regression checks, 'tailcalls' kselftest was executed: > > $ sudo ./test_progs -t tailcalls > > #64/1 tailcall_1:OK > > #64/2 tailcall_2:OK > > #64/3 tailcall_3:OK > > #64/4 tailcall_4:OK > > #64/5 tailcall_5:OK > > #64 tailcalls:OK > > Summary: 1/5 PASSED, 0 SKIPPED, 0 FAILED > > > > Tail call related cases from test_verifier kselftest are also working > > fine. Sample BPF programs that utilize tail calls (sockex3, tracex5) > > work properly as well. > > > > [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@xxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > > > Suggested-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> > > Overall approach looks reasonable to me. The patch here could still be cleaned up a > bit further, still very rough. Just minor comments below: Thank you for spotting all of the issues. I will provide a v2 on monday as from today to sunday I will be out of reach. > > > --- > > arch/x86/net/bpf_jit_comp.c | 241 +++++++++++++++++++++++++++--------- > > include/linux/bpf.h | 8 +- > > kernel/bpf/arraymap.c | 61 +++++++-- > > kernel/bpf/core.c | 3 +- > > 4 files changed, 239 insertions(+), 74 deletions(-) > > > [...] > > /* > > - * Emit x86-64 prologue code for BPF program and check its size. > > + * Emit x86-64 prologue code for BPF program. > > * bpf_tail_call helper will skip it while jumping into another program > > */ > > -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf) > > +static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > > + bool tail_call) > > { > > u8 *prog = *pprog; > > int cnt = X86_PATCH_SIZE; > > @@ -238,19 +269,16 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf) > > */ > > memcpy(prog, ideal_nops[NOP_ATOMIC5], cnt); > > prog += cnt; > > + if (!ebpf_from_cbpf && tail_call) > > + EMIT2(0x31, 0xC0); /* xor eax, eax */ > > + else > > + EMIT2(0x66, 0x90); /* nop2 */ > > nit: Why does the ebpf_from_cbpf need the extra nop? Good catch, it doesn't need it :) > > > EMIT1(0x55); /* push rbp */ > > EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */ > > /* sub rsp, rounded_stack_depth */ > > EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8)); > > - EMIT1(0x53); /* push rbx */ > > - EMIT2(0x41, 0x55); /* push r13 */ > > - EMIT2(0x41, 0x56); /* push r14 */ > > - EMIT2(0x41, 0x57); /* push r15 */ > > - if (!ebpf_from_cbpf) { > > - /* zero init tail_call_cnt */ > > - EMIT2(0x6a, 0x00); > > - BUILD_BUG_ON(cnt != PROLOGUE_SIZE); > > - } > > + if (!ebpf_from_cbpf && tail_call) > > + EMIT1(0x50); /* push rax */ > > *pprog = prog; > > } > [...] > > -static void emit_bpf_tail_call_indirect(u8 **pprog) > > +static void emit_bpf_tail_call_indirect(u8 **pprog, bool *callee_regs_used, > > + u32 stack_depth) > > { > > u8 *prog = *pprog; > > - int label1, label2, label3; > > + int pop_bytes = 0; > > + int off1 = 49; > > + int off2 = 38; > > + int off3 = 16; > > int cnt = 0; > > + /* count the additional bytes used for popping callee regs from stack > > + * that need to be taken into account for each of the offsets that > > + * are used for bailing out of the tail call > > + */ > > + pop_bytes = get_pop_bytes(callee_regs_used); > > + off1 += pop_bytes; > > + off2 += pop_bytes; > > + off3 += pop_bytes; > > + > > /* > > * rdi - pointer to ctx > > * rsi - pointer to bpf_array > > @@ -370,72 +427,108 @@ static void emit_bpf_tail_call_indirect(u8 **pprog) > > EMIT2(0x89, 0xD2); /* mov edx, edx */ > > EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], edx */ > > offsetof(struct bpf_array, map.max_entries)); > > -#define OFFSET1 (41 + RETPOLINE_RAX_BPF_JIT_SIZE) /* Number of bytes to jump */ > > +#define OFFSET1 (off1 + RETPOLINE_RCX_BPF_JIT_SIZE) /* Number of bytes to jump */ > > The whole rename belongs into the first patch to avoid breaking bisectability > as mentioned. Ack. > > > EMIT2(X86_JBE, OFFSET1); /* jbe out */ > > - label1 = cnt; > > /* > > * if (tail_call_cnt > MAX_TAIL_CALL_CNT) > > * goto out; > > */ > > - EMIT2_off32(0x8B, 0x85, -36 - MAX_BPF_STACK); /* mov eax, dword ptr [rbp - 548] */ > > + EMIT2_off32(0x8B, 0x85 /* mov eax, dword ptr [rbp - (4 + sd)] */, > > + -4 - round_up(stack_depth, 8)); > > EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT */ > > -#define OFFSET2 (30 + RETPOLINE_RAX_BPF_JIT_SIZE) > > +#define OFFSET2 (off2 + RETPOLINE_RCX_BPF_JIT_SIZE) > > EMIT2(X86_JA, OFFSET2); /* ja out */ > > - label2 = cnt; > > EMIT3(0x83, 0xC0, 0x01); /* add eax, 1 */ > > - EMIT2_off32(0x89, 0x85, -36 - MAX_BPF_STACK); /* mov dword ptr [rbp -548], eax */ > > + EMIT2_off32(0x89, 0x85, /* mov dword ptr [rbp - (4 + sd)], eax */ > > + -4 - round_up(stack_depth, 8)); > > nit: should probably sit in a var Sure. > > > /* prog = array->ptrs[index]; */ > > - EMIT4_off32(0x48, 0x8B, 0x84, 0xD6, /* mov rax, [rsi + rdx * 8 + offsetof(...)] */ > > + EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6, /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */ > > offsetof(struct bpf_array, ptrs)); > > /* > > * if (prog == NULL) > > * goto out; > > */ > > - EMIT3(0x48, 0x85, 0xC0); /* test rax,rax */ > > -#define OFFSET3 (8 + RETPOLINE_RAX_BPF_JIT_SIZE) > > - EMIT2(X86_JE, OFFSET3); /* je out */ > > - label3 = cnt; > > + EMIT3(0x48, 0x85, 0xC9); /* test rcx,rcx */ > > +#define OFFSET3 (off3 + RETPOLINE_RCX_BPF_JIT_SIZE) > > + EMIT2(X86_JE, OFFSET3); /* je out */ > [...] > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index c67c88ad35f8..38897b9c7d61 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -651,14 +651,15 @@ enum bpf_jit_poke_reason { > > /* Descriptor of pokes pointing /into/ the JITed image. */ > > struct bpf_jit_poke_descriptor { > > - void *ip; > > + void *tailcall_target; > > + void *tailcall_bypass; > > union { > > struct { > > struct bpf_map *map; > > u32 key; > > } tail_call; > > }; > > - bool ip_stable; > > + bool tailcall_target_stable; > > Probably makes sense to split off the pure rename into a separate patch to > reduce this one slightly. I was thinking of that as well. I will pull out as you're suggesting. > > > u8 adj_off; > > u16 reason; > > }; > > @@ -1775,6 +1776,9 @@ enum bpf_text_poke_type { > > BPF_MOD_JUMP, > > }; > > +/* Number of bytes emit_patch() needs to generate instructions */ > > +#define X86_PATCH_SIZE 5 > > nit: this is arch specific, so should not be exposed in here, neither in > arraymap.c below Okay, so I think that we should add another member to poke descriptor that will hold specifically the bypass address so that we wouldn't have to calculate it in here. And I think this extension should go to this patch whereas the renaming to a separate. > > > int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, > > void *addr1, void *addr2); > > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c > > index c66e8273fccd..d15729a3f46c 100644 > > --- a/kernel/bpf/arraymap.c > > +++ b/kernel/bpf/arraymap.c > > @@ -750,6 +750,7 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > struct bpf_prog *old, > > struct bpf_prog *new) > > { > > + u8 *bypass_addr, *old_addr, *new_addr; > > struct prog_poke_elem *elem; > > struct bpf_array_aux *aux; > > @@ -770,12 +771,13 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > * there could be danger of use after free otherwise. > > * 2) Initially when we start tracking aux, the program > > * is not JITed yet and also does not have a kallsyms > > - * entry. We skip these as poke->ip_stable is not > > - * active yet. The JIT will do the final fixup before > > - * setting it stable. The various poke->ip_stable are > > - * successively activated, so tail call updates can > > - * arrive from here while JIT is still finishing its > > - * final fixup for non-activated poke entries. > > + * entry. We skip these as poke->tailcall_target_stable > > + * is not active yet. The JIT will do the final fixup > > + * before setting it stable. The various > > + * poke->tailcall_target_stable are successively activated, > > + * so tail call updates can arrive from here while JIT > > + * is still finishing its final fixup for non-activated > > + * poke entries. > > * 3) On program teardown, the program's kallsym entry gets > > * removed out of RCU callback, but we can only untrack > > * from sleepable context, therefore bpf_arch_text_poke() > > @@ -792,20 +794,53 @@ static void prog_array_map_poke_run(struct bpf_map *map, u32 key, > > * 5) Any other error happening below from bpf_arch_text_poke() > > * is a unexpected bug. > > */ > > - if (!READ_ONCE(poke->ip_stable)) > > + if (!READ_ONCE(poke->tailcall_target_stable)) > > continue; > > if (poke->reason != BPF_POKE_REASON_TAIL_CALL) > > continue; > > if (poke->tail_call.map != map || > > poke->tail_call.key != key) > > continue; > > + /* protect against un-updated poke descriptors since > > + * we could fill them from subprog and the same desc > > + * is present on main's program poke tab > > + */ > > + if (!poke->tailcall_bypass || !poke->tailcall_target) > > + continue; > > Can't we avoid copying these descriptors over to the subprog in the first place? I think we can, but can we consider it as something that we will do as a follow-up? > > > + if (!old && !new) > > + continue; > > Could we avoid this above but instead signal via bpf_arch_text_poke() that nothing > had to be patched? Reason is that bpf_arch_text_poke() will still do the sanity > check to make sure reality meets expectation wrt current insns (which is also > why I didn't add this skip). In that case we could then just avoid the expensive > synchronize_rcu(). I was even thinking to have such a check before walking through the poke descriptors, so that's the opposite of what you suggest. If you insist, I can play with this a bit on monday, but I recall that it was the only thing that was stopping the Alexei's pseudo-code from being fully functional (the nop->nop update). > > > - ret = bpf_arch_text_poke(poke->ip, BPF_MOD_JUMP, > > - old ? (u8 *)old->bpf_func + > > - poke->adj_off : NULL, > > - new ? (u8 *)new->bpf_func + > > - poke->adj_off : NULL); > > - BUG_ON(ret < 0 && ret != -EINVAL); > > + bypass_addr = (u8 *)poke->tailcall_target + X86_PATCH_SIZE; > > + old_addr = old ? (u8 *)old->bpf_func + poke->adj_off : NULL; > > + new_addr = new ? (u8 *)new->bpf_func + poke->adj_off : NULL; > > + > > + if (new) { > > + ret = bpf_arch_text_poke(poke->tailcall_target, > > + BPF_MOD_JUMP, > > + old_addr, new_addr); > > + BUG_ON(ret < 0 && ret != -EINVAL); > > + if (!old) { > > + ret = bpf_arch_text_poke(poke->tailcall_bypass, > > + BPF_MOD_JUMP, > > + bypass_addr, NULL); > > + BUG_ON(ret < 0 && ret != -EINVAL); > > + } > > + } else { > > + ret = bpf_arch_text_poke(poke->tailcall_bypass, > > + BPF_MOD_JUMP, > > + NULL, bypass_addr); > > + BUG_ON(ret < 0 && ret != -EINVAL); > > + /* let other CPUs finish the execution of program > > + * so that it will not possible to expose them > > + * to invalid nop, stack unwind, nop state > > + */ > > + synchronize_rcu(); > > Very heavyweight that we need to potentially call this /multiple/ times for just a > /single/ map update under poke mutex even ... but agree it's needed here to avoid > racing. :( > > > + ret = bpf_arch_text_poke(poke->tailcall_target, > > + BPF_MOD_JUMP, > > + old_addr, NULL); > > + BUG_ON(ret < 0 && ret != -EINVAL); > > + } > > } > > } > > }