On Sat, Jun 15, 2024 at 10:52 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > > On 6/13/24 5:30 PM, Alexei Starovoitov wrote: > > On Sun, Jun 9, 2024 at 10:18 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote: > > I think "shadow stack" already has at least two different meanings > > in the kernel. > > Let's avoid adding 3rd. > > How about "divided stack" ? > > Naming is hard. Maybe "private stack" which suggests the stack is private > to that program? I like it. "private stack" fits the best. > > > >> +static void emit_percpu_shadow_frame_ptr(u8 **pprog, void *shadow_frame_ptr) > >> +{ > >> + u8 *prog = *pprog; > >> + > >> + /* movabs r9, shadow_frame_ptr */ > >> + emit_mov_imm32(&prog, false, X86_REG_R9, (u32) (long) shadow_frame_ptr); > >> + > >> + /* add <r9>, gs:[<off>] */ > >> + EMIT2(0x65, 0x4c); > >> + EMIT3(0x03, 0x0c, 0x25); > >> + EMIT((u32)(unsigned long)&this_cpu_off, 4); > > I think this can be one insn: > > lea r9, gs:[(u32)shadow_frame_ptr] > > Apparently, __alloc_percpu_gfp() may return a pointer which is beyond 32bit. That is why my > RFC patch failed CI. I later tried to use > > + /* movabs r9, shadow_frame_ptr */ > + emit_mov_imm64(&prog, X86_REG_R9, (long) shadow_frame_ptr >> 32, > + (u32) (long) shadow_frame_ptr); > > and CI is successful. I did some on-demand test (https://github.com/kernel-patches/bpf/pull/7179) > and it succeeded with CI. > > If __alloc_percpu_gfp() returns a pointer beyond 32bit, I am not sure > whether we could get r9 with a single insn. I see. Ok. Let's keep two insns sequence. > > > > >> + if (stack_depth && enable_shadow_stack) { > > I think enabling it for progs with small stack usage > > is unnecessary. > > The definition of "small" is complicated. > > I feel stack_depth <= 64 can stay as-is and > > all networking progs don't have to use it either, > > since they're called from known places. > > While tracing progs can be anywhere, so I'd enable > > divided stack for > > stack_depth > 64 && prog_type == kprobe, tp, raw_tp, tracing, perf_event. > > This does make sense. It partially aligns what I think for prog type > side. We only need to enable 'divided stack' for certain prog types. > > > > >> + if (bpf_prog->percpu_shadow_stack_ptr) { > >> + percpu_shadow_stack_ptr = bpf_prog->percpu_shadow_stack_ptr; > >> + } else { > >> + percpu_shadow_stack_ptr = __alloc_percpu_gfp(stack_depth, 8, GFP_KERNEL); > >> + if (!percpu_shadow_stack_ptr) > >> + return -ENOMEM; > >> + bpf_prog->percpu_shadow_stack_ptr = percpu_shadow_stack_ptr; > >> + } > >> + shadow_frame_ptr = percpu_shadow_stack_ptr + round_up(stack_depth, 8); > >> + stack_depth = 0; > >> + } else { > >> + enable_shadow_stack = 0; > >> + } > >> + > >> arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena); > >> user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena); > >> > >> @@ -1342,7 +1377,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image > >> /* tail call's presence in current prog implies it is reachable */ > >> tail_call_reachable |= tail_call_seen; > >> > >> - emit_prologue(&prog, bpf_prog->aux->stack_depth, > >> + emit_prologue(&prog, stack_depth, > >> bpf_prog_was_classic(bpf_prog), tail_call_reachable, > >> bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb); > >> /* Exception callback will clobber callee regs for its own use, and > >> @@ -1364,6 +1399,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image > >> emit_mov_imm64(&prog, X86_REG_R12, > >> arena_vm_start >> 32, (u32) arena_vm_start); > >> > >> + if (enable_shadow_stack) > >> + emit_percpu_shadow_frame_ptr(&prog, shadow_frame_ptr); > >> + > >> ilen = prog - temp; > >> if (rw_image) > >> memcpy(rw_image + proglen, temp, ilen); > >> @@ -1383,6 +1421,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image > >> u8 *func; > >> int nops; > >> > >> + if (enable_shadow_stack) { > >> + if (src_reg == BPF_REG_FP) > >> + src_reg = X86_REG_R9; > >> + > >> + if (dst_reg == BPF_REG_FP) > >> + dst_reg = X86_REG_R9; > > the verifier will reject a prog that attempts to write into R10. > > So the above shouldn't be necessary. > > Actually there is at least one exception, e.g., > if r10 > r5 goto +5 > where dst is r10 and src r5. Good point. We even have such a selftest to make sure it's rejected in unpriv. SEC("socket") __description("unpriv: cmp of frame pointer") __success __failure_unpriv __msg_unpriv("R10 pointer comparison") __retval(0) __naked void unpriv_cmp_of_frame_pointer(void) { asm volatile (" \ if r10 == 0 goto l0_%=; \ > > > >> + } > >> + > >> switch (insn->code) { > >> /* ALU */ > >> case BPF_ALU | BPF_ADD | BPF_X: > >> @@ -2014,6 +2060,7 @@ st: if (is_imm8(insn->off)) > >> emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0); > >> /* Restore R0 after clobbering RAX */ > >> emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX); > >> + > >> break; > >> } > >> > >> @@ -2038,14 +2085,20 @@ st: if (is_imm8(insn->off)) > >> > >> func = (u8 *) __bpf_call_base + imm32; > >> if (tail_call_reachable) { > >> - RESTORE_TAIL_CALL_CNT(bpf_prog->aux->stack_depth); > >> + RESTORE_TAIL_CALL_CNT(stack_depth); > >> ip += 7; > >> } > >> if (!imm32) > >> return -EINVAL; > >> + if (enable_shadow_stack) { > >> + EMIT2(0x41, 0x51); > >> + ip += 2; > >> + } > >> ip += x86_call_depth_emit_accounting(&prog, func, ip); > >> if (emit_call(&prog, func, ip)) > >> return -EINVAL; > >> + if (enable_shadow_stack) > >> + EMIT2(0x41, 0x59); > > push/pop around calls are load/store plus math on %rsp. > > I think it's cheaper to reload r9 after the call with > > a single insn. > > The reload of r9 is effectively gs+const. > > There is no memory access. So it should be faster. > > Two insn may be necessary since __alloc_percpu_gfp() > may return a pointer beyond 32 bits. > > > > > Technically we can replace all uses of R10==rbp with > > 'gs:' based instructions. > > Like: > > r1 = r10 > > can be jitted into > > lea rdi, gs + (u32)shadow_frame_ptr > > > > and r0 = *(u32 *)(r10 - 64) > > can be jitted into: > > mov rax, dword ptr gs:[(u32)shadow_frame_ptr - 64] > > > > but that is probably a bunch of jit changes. > > So I'd start with a simple reload of r9 after each call. > > This is a good idea. We might need this so we only have > one extra insn per call. Since reload of r9 is a two insn sequence of 64-bit mov immediate, and load from gs:this_cpu_off, I suspect, push/pop r9 might be faster. So I'd stick to what you have already. Interesting though that static per-cpu vars have 32-bit pointers, but dynamic per-cpu alloc returns full 64-bit? hmm.