Re: [PATCH bpf-next v11 4/7] bpf, x86: Support private stack in jit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 11, 2024 at 3:18 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
>
>
>
> On 11/9/24 12:14 PM, Alexei Starovoitov wrote:
> > On Fri, Nov 8, 2024 at 6:53 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> >>
> >>          stack_depth = bpf_prog->aux->stack_depth;
> >> +       if (bpf_prog->aux->priv_stack_ptr) {
> >> +               priv_frame_ptr = bpf_prog->aux->priv_stack_ptr + round_up(stack_depth, 16);
> >> +               stack_depth = 0;
> >> +       }
> > ...
> >
> >> +       priv_stack_ptr = prog->aux->priv_stack_ptr;
> >> +       if (!priv_stack_ptr && prog->aux->jits_use_priv_stack) {
> >> +               priv_stack_ptr = __alloc_percpu_gfp(prog->aux->stack_depth, 16, GFP_KERNEL);
> > After applying I started to see crashes running test_progs -j like:
> >
> > [  173.465191] Oops: general protection fault, probably for
> > non-canonical address 0xdffffc0000000af9: 0000 [#1] PREEMPT SMP KASAN
> > [  173.466053] KASAN: probably user-memory-access in range
> > [0x00000000000057c8-0x00000000000057cf]
> > [  173.466053] RIP: 0010:dst_dev_put+0x1e/0x220
> > [  173.466053] Call Trace:
> > [  173.466053]  <IRQ>
> > [  173.466053]  ? die_addr+0x40/0xa0
> > [  173.466053]  ? exc_general_protection+0x138/0x1f0
> > [  173.466053]  ? asm_exc_general_protection+0x26/0x30
> > [  173.466053]  ? dst_dev_put+0x1e/0x220
> > [  173.466053]  rt_fibinfo_free_cpus.part.0+0x8c/0x130
> > [  173.466053]  fib_nh_common_release+0xd6/0x2a0
> > [  173.466053]  free_fib_info_rcu+0x129/0x360
> > [  173.466053]  ? rcu_core+0xa55/0x1340
> > [  173.466053]  rcu_core+0xa55/0x1340
> > [  173.466053]  ? rcutree_report_cpu_dead+0x380/0x380
> > [  173.466053]  ? hrtimer_interrupt+0x319/0x7c0
> > [  173.466053]  handle_softirqs+0x14c/0x4d0
> >
> > [   35.134115] Oops: general protection fault, probably for
> > non-canonical address 0xe0000bfff101fbbc: 0000 [#1] PREEMPT SMP KASAN
> > [   35.135089] KASAN: probably user-memory-access in range
> > [0x00007fff880fdde0-0x00007fff880fdde7]
> > [   35.135089] RIP: 0010:destroy_workqueue+0x4b4/0xa70
> > [   35.135089] Call Trace:
> > [   35.135089]  <TASK>
> > [   35.135089]  ? die_addr+0x40/0xa0
> > [   35.135089]  ? exc_general_protection+0x138/0x1f0
> > [   35.135089]  ? asm_exc_general_protection+0x26/0x30
> > [   35.135089]  ? destroy_workqueue+0x4b4/0xa70
> > [   35.135089]  ? destroy_workqueue+0x592/0xa70
> > [   35.135089]  ? __mutex_unlock_slowpath.isra.0+0x270/0x270
> > [   35.135089]  ext4_put_super+0xff/0xd70
> > [   35.135089]  generic_shutdown_super+0x148/0x4c0
> > [   35.135089]  kill_block_super+0x3b/0x90
> > [   35.135089]  ext4_kill_sb+0x65/0x90
> >
> > I think I see the bug... quoted it above...
> >
> > Please make sure you reproduce it first.
>
> Indeed, to use the allocation size round_up(stack_depth, 16) for __alloc_percpu_gfp()
> indeed fixed the problem.
>
> The following is additional change on top of this patch set to
>    - fix the memory access bug as suggested by Alexei in the above
>    - Add guard space for private stack, additional 16 bytes at the
>      end of stack will be the guard space. The content is prepopulated
>      and checked at per cpu private stack free site. If the content
>      is not expected, a kernel message will emit.
>    - Add kasan support for guard space.
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 55556a64f776..d796d419bb48 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1446,6 +1446,9 @@ static void emit_priv_frame_ptr(u8 **pprog, void __percpu *priv_frame_ptr)
>   #define LOAD_TAIL_CALL_CNT_PTR(stack)                          \
>          __LOAD_TCC_PTR(BPF_TAIL_CALL_CNT_PTR_STACK_OFF(stack))
>
> +#define PRIV_STACK_GUARD_SZ    16
> +#define PRIV_STACK_GUARD_VAL   0xEB9F1234eb9f1234ULL
> +
>   static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
>                    int oldproglen, struct jit_context *ctx, bool jmp_padding)
>   {
> @@ -1462,10 +1465,11 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>          u8 *prog = temp;
>          u32 stack_depth;
>          int err;
> +       // int stack_size;
>
>          stack_depth = bpf_prog->aux->stack_depth;
>          if (bpf_prog->aux->priv_stack_ptr) {
> -               priv_frame_ptr = bpf_prog->aux->priv_stack_ptr + round_up(stack_depth, 16);
> +               priv_frame_ptr = bpf_prog->aux->priv_stack_ptr + round_up(stack_depth, 16) + PRIV_STACK_GUARD_SZ;
>                  stack_depth = 0;

Since priv stack is not really a stack there is no need to align it to 16
and no need to round_up() either.
let's drop these parts and it will simplify the code.

Re: GUARD_SZ.
I think it's better to guard top and bottom.
8 byte for each will do.

>          }
>
> @@ -1496,8 +1500,18 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>                  emit_mov_imm64(&prog, X86_REG_R12,
>                                 arena_vm_start >> 32, (u32) arena_vm_start);
>
> -       if (priv_frame_ptr)
> +       if (priv_frame_ptr) {
> +#if 0
> +               /* hack to emit and write some data to guard area */
> +               emit_priv_frame_ptr(&prog, bpf_prog->aux->priv_stack_ptr);
> +
> +               /* See case BPF_ST | BPF_MEM | BPF_W */
> +               EMIT2(0x41, 0xC7);
> +               EMIT2(add_1reg(0x40, X86_REG_R9), 0);
> +               EMIT(0xFFFFFFFF, bpf_size_to_x86_bytes(BPF_W));
> +#endif
>                  emit_priv_frame_ptr(&prog, priv_frame_ptr);
> +       }
>
>          ilen = prog - temp;
>          if (rw_image)
> @@ -3383,11 +3397,13 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>          struct x64_jit_data *jit_data;
>          int proglen, oldproglen = 0;
>          struct jit_context ctx = {};
> +       int priv_stack_size, cpu;
>          bool tmp_blinded = false;
>          bool extra_pass = false;
>          bool padding = false;
>          u8 *rw_image = NULL;
>          u8 *image = NULL;
> +       u64 *guard_ptr;
>          int *addrs;
>          int pass;
>          int i;
> @@ -3418,11 +3434,17 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>          }
>          priv_stack_ptr = prog->aux->priv_stack_ptr;
>          if (!priv_stack_ptr && prog->aux->jits_use_priv_stack) {
> -               priv_stack_ptr = __alloc_percpu_gfp(prog->aux->stack_depth, 16, GFP_KERNEL);
> +               priv_stack_size = round_up(prog->aux->stack_depth, 16) + PRIV_STACK_GUARD_SZ;
> +               priv_stack_ptr = __alloc_percpu_gfp(priv_stack_size, 16, GFP_KERNEL);
>                  if (!priv_stack_ptr) {
>                          prog = orig_prog;
>                          goto out_priv_stack;
>                  }
> +               for_each_possible_cpu(cpu) {
> +                       guard_ptr = per_cpu_ptr(priv_stack_ptr, cpu);
> +                       guard_ptr[0] = guard_ptr[1] = PRIV_STACK_GUARD_VAL;
> +                       kasan_poison_vmalloc(guard_ptr, PRIV_STACK_GUARD_SZ);

with top and bottom guards there will be two calls to poison.

But did you check that percpu allocs come from the vmalloc region?
Does kasan_poison_vmalloc() actually work or silently nop ?

> +               }
>                  prog->aux->priv_stack_ptr = priv_stack_ptr;
>          }
>          addrs = jit_data->addrs;
> @@ -3561,6 +3583,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>   out_addrs:
>                  kvfree(addrs);
>                  if (!image && priv_stack_ptr) {
> +                       for_each_possible_cpu(cpu) {
> +                               guard_ptr = per_cpu_ptr(priv_stack_ptr, cpu);
> +                               kasan_unpoison_vmalloc(guard_ptr, PRIV_STACK_GUARD_SZ, KASAN_VMALLOC_PROT_NORMAL);
> +                       }
>                          free_percpu(priv_stack_ptr);
>                          prog->aux->priv_stack_ptr = NULL;
>                  }
> @@ -3603,6 +3629,9 @@ void bpf_jit_free(struct bpf_prog *prog)
>          if (prog->jited) {
>                  struct x64_jit_data *jit_data = prog->aux->jit_data;
>                  struct bpf_binary_header *hdr;
> +               void __percpu *priv_stack_ptr;
> +               u64 *guard_ptr;
> +               int cpu;
>
>                  /*
>                   * If we fail the final pass of JIT (from jit_subprogs),
> @@ -3618,7 +3647,21 @@ void bpf_jit_free(struct bpf_prog *prog)
>                  prog->bpf_func = (void *)prog->bpf_func - cfi_get_offset();
>                  hdr = bpf_jit_binary_pack_hdr(prog);
>                  bpf_jit_binary_pack_free(hdr, NULL);
> -               free_percpu(prog->aux->priv_stack_ptr);
> +
> +               priv_stack_ptr = prog->aux->priv_stack_ptr;
> +               if (priv_stack_ptr) {
> +                       int stack_size;
> +
> +                       stack_size = round_up(prog->aux->stack_depth, 16) + PRIV_STACK_GUARD_SZ;
> +                       for_each_possible_cpu(cpu) {
> +                               guard_ptr = per_cpu_ptr(priv_stack_ptr, cpu);
> +                               kasan_unpoison_vmalloc(guard_ptr, PRIV_STACK_GUARD_SZ, KASAN_VMALLOC_PROT_NORMAL);
> +                               if (guard_ptr[0] != PRIV_STACK_GUARD_VAL || guard_ptr[1] != PRIV_STACK_GUARD_VAL)
> +                                       pr_err("Private stack Overflow happened for prog %sx\n", prog->aux->name);
> +                       }

Common helper is needed to check guards before free_percpu.

> +                       free_percpu(priv_stack_ptr);
> +               }
> +
>                  WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
>          }
>
> This fixed the issue Alexei discovered.
>
> 16 bytes guard space is allocated since allocation is done with 16byte aligned
> with multiple-16 size. If bpf program overflows the stack (change '#if 0' to '#if 1')
> in the above change, we will see:
>
> [root@arch-fb-vm1 bpf]# ./test_progs -n 336
> [   28.447390] bpf_testmod: loading out-of-tree module taints kernel.
> [   28.448180] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
> #336/1   struct_ops_private_stack/private_stack:OK
> #336/2   struct_ops_private_stack/private_stack_fail:OK
> #336/3   struct_ops_private_stack/private_stack_recur:OK
> #336     struct_ops_private_stack:OK
> Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED
> [   28.737710] Private stack Overflow happened for prog Fx
> [   28.739284] Private stack Overflow happened for prog Fx
> [   28.968732] Private stack Overflow happened for prog Fx
>
> Here the func name is 'Fx' (representing the sub prog). We might need
> to add more meaningful info (e.g. main prog name) to make message more
> meaningful.

Probably worth introducing a helper like:

const char *bpf_get_prog_name(prog)
{
  if (fp->aux->ksym.prog)
    return prog->aux->ksym.name;
  return prog->aux->name;
}


>
> I added some changes related kasan. If I made a change to guard space in kernel (not in bpf prog),
> the kasan can print out the error message properly. But unfortunately, in jit, there is no
> kasan instrumentation so warning (with "#if 1" change) is not reported. One possibility is
> if kernel config enables kasan, bpf jit could add kasan to jited binary. Not sure the
> complexity and whether it is worthwhile or not since supposedly verifier should already
> prevent overflow and we already have a guard check (Private stack overflow happened ...)
> in jit.

As a follow up we should teach JIT to emit calls __asan_loadN/storeN
when processing LDX/STX.
imo it's been long overdue.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux