Re: [PATCH bpf] x86/bpf: handle bpf-program-triggered exceptions properly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 27, 2021 at 5:06 PM Yonghong Song <yhs@xxxxxx> wrote:
>
>
>
> On 1/27/21 1:47 PM, Andy Lutomirski wrote:
> > On Mon, Jan 25, 2021 at 4:12 PM Yonghong Song <yhs@xxxxxx> wrote:
> >>
> >> When reviewing patch ([1]), which adds a script to run bpf selftest
> >> through qemu at /sbin/init stage, I found the following kernel bug
> >> warning:
> >>
> >> [  112.118892] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1351
> >> [  112.119805] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 354, name: new_name
> >> [  112.120512] 3 locks held by new_name/354:
> >> [  112.120868]  #0: ffff88800476e0a0 (&p->lock){+.+.}-{3:3}, at: bpf_seq_read+0x3a/0x3d0
> >> [  112.121573]  #1: ffffffff82d69800 (rcu_read_lock){....}-{1:2}, at: bpf_iter_run_prog+0x5/0x160
> >> [  112.122348]  #2: ffff8880061c2088 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x1a1/0x640
> >> [  112.123128] Preemption disabled at:
> >> [  112.123130] [<ffffffff8108f913>] migrate_disable+0x33/0x80
> >> [  112.123942] CPU: 0 PID: 354 Comm: new_name Tainted: G           O      5.11.0-rc4-00524-g6e66fbb
> >> 10597-dirty #1249
> >> [  112.124822] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01
> >> /2014
> >> [  112.125614] Call Trace:
> >> [  112.125835]  dump_stack+0x77/0x97
> >> [  112.126137]  ___might_sleep.cold.119+0xf2/0x106
> >> [  112.126537]  exc_page_fault+0x1c1/0x640
> >> [  112.126888]  asm_exc_page_fault+0x1e/0x30
> >> [  112.127241] RIP: 0010:bpf_prog_0a182df2d34af188_dump_bpf_prog+0xf5/0xb3c
> >> [  112.127825] Code: 00 00 8b 7d f4 41 8b 76 44 48 39 f7 73 06 48 01 fb 49 89 df 4c 89 7d d8 49 8b
> >> bd 20 01 00 00 48 89 7d e0 49 8b bd e0 00 00 00 <48> 8b 7f 20 48 01 d7 48 89 7d e8 48 89 e9 48 83 c
> >> 1 d0 48 8b 7d c8
> >> [  112.129433] RSP: 0018:ffffc9000035fdc8 EFLAGS: 00010282
> >> [  112.129895] RAX: 0000000000000000 RBX: ffff888005a49458 RCX: 0000000000000024
> >> [  112.130509] RDX: 00000000000002f0 RSI: 0000000000000509 RDI: 0000000000000000
> >> [  112.131126] RBP: ffffc9000035fe20 R08: 0000000000000001 R09: 0000000000000000
> >> [  112.131737] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000400
> >> [  112.132355] R13: ffff888006085800 R14: ffff888004718540 R15: ffff888005a49458
> >> [  112.132990]  ? bpf_prog_0a182df2d34af188_dump_bpf_prog+0xc8/0xb3c
> >> [  112.133526]  bpf_iter_run_prog+0x75/0x160
> >> [  112.133880]  __bpf_prog_seq_show+0x39/0x40
> >> [  112.134258]  bpf_seq_read+0xf6/0x3d0
> >> [  112.134582]  vfs_read+0xa3/0x1b0
> >> [  112.134873]  ksys_read+0x4f/0xc0
> >> [  112.135166]  do_syscall_64+0x2d/0x40
> >> [  112.135482]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>
> >> To reproduce the issue, with patch [1] and use the following script:
> >>    tools/testing/selftests/bpf/run_in_vm.sh -- cat /sys/fs/bpf/progs.debug
> >>
> >> The reason of the above kernel warning is due to bpf program
> >> tries to dereference an address of 0 and which is not caught
> >> by bpf exception handling logic.
> >>
> >> ...
> >> SEC("iter/bpf_prog")
> >> int dump_bpf_prog(struct bpf_iter__bpf_prog *ctx)
> >> {
> >>          struct bpf_prog *prog = ctx->prog;
> >>          struct bpf_prog_aux *aux;
> >>          ...
> >>          if (!prog)
> >>                  return 0;
> >>          aux = prog->aux;
> >>          ...
> >>          ... aux->dst_prog->aux->name ...
> >>          return 0;
> >> }
> >>
> >> If the aux->dst_prog is NULL pointer, a fault will happen when trying
> >> to access aux->dst_prog->aux.
> >>
> >
> > Which would be a bug in the bpf verifier, no?  This is a bpf program
> > that apparently passed verification, and it somehow dereferenced a
> > NULL pointer.
> >
> > Let's enumerate some cases.
> >
> > 1. x86-like architecture, SMAP enabled.  The CPU determines that this
> > is bogus, and bpf gets lucky, because the x86 code runs the exception
> > handler instead of summarily OOPSing.  IMO it would be entirely
> > reasonable to OOPS.
> >
> > 2 x86-like architecture, SMAP disabled.  This looks like a valid user
> > access, and for all bpf knows, 0 might be an actual mapped address,
> > and it might have userfaultfd attached, etc.  And it's called from a
> > non-sleepable context.  The warning is entirely correct.
> >
> > 3. s390x-like architecture.  This is a dereference of a bad kernel
> > pointer.  OOPS, unless you get lucky.
> >
> >
> > This patch is totally bogus.  You can't work around this by some magic
> > exception handling.  Unless I'm missing something big, this is a bpf
> > bug.  The following is not a valid kernel code pattern:
> >
> > label:
> >    dereference might-be-NULL kernel pointer
> > _EXHANDLER_...(label, ...); /* try to recover */
> >
> > This appears to have been introduced by:
> >
> > commit 3dec541b2e632d630fe7142ed44f0b3702ef1f8c
> > Author: Alexei Starovoitov <ast@xxxxxxxxxx>
> > Date:   Tue Oct 15 20:25:03 2019 -0700
> >
> >      bpf: Add support for BTF pointers to x86 JIT
> >
> > Alexei, this looks like a very long-winded and ultimately incorrect
> > way to remove the branch from:
> >
> > if (ptr)
> >    val = *ptr;
> >
> > Wouldn't it be better to either just emit the branch directly or to
> > make sure that the pointer is valid in the first place?
>
> Let me explain the motivation for this patch.
>
> Previously, for any kernel data structure access,
> people has to use bpf_probe_read() or bpf_probe_read_kernel()
> helper even most of these accesses are okay and will not
> fault. For example, for
>     int t = a->b->c->d
> three bpf_probe_read() will be needed, e.g.,
>     bpf_probe_read_kernel(&t1, sizeof(t1), &a->b);
>     bpf_probe_read_kernel(&t2, sizeof(t2), &t1->c);
>     bpf_probe_read_kernel(&t, sizeof(t), &t2->d);
>
> if there is a fault, bpf_probe_read_kernel() helper will
> suppress the exception and clears the dest buffer and
> return.
>
> The above usage of bpf_probe_read_kernel()
> is complicated and not C like and bpf developers
> does not like it.
>
> bcc (https://github.com/iovisor/bcc/) permits
> users to write "a->b->c->d" styles and then through
> clang rewriter to convert it to a series of
> bpf_probe_read_kernel()'s. But most users are
> directly using clang to compile their programs so
> they have to write bpf_probe_read_kernel()'s
> by themselves.
>
> The motivation here is to improve user experience so
> user can just write
>     int t = a->b->c->d
> some kernel will automatically take care of exceptions
> and maintain bpf_probe_read_kernel() semantics.
> So what the above patch essentially did is to check if the "regs->ip"
> is one of ips which try to a "bpf_probe_read_kernel()" (actually
> a direct access), it will fix up exception (clear the dest register)
> and returns.

Okay, so I guess you're trying to inline probe_read_kernel().  But
that means you have to inline a valid implementation.  In particular,
you need to check that you're accessing *kernel* memory.  Just like
how get_user() validates that the pointer points into user memory,
your helper should bounds check the pointer.  On x86, you could check
the high bit.

As an extra complication, we should really add logic to
get_kernel_nofault() to verify that the pointer points into actual
memory as opposed to MMIO space (or future incoherent MKTME space or
something like that, sigh).  This will severely complicate inlining
it.  And we should *really* make the same fix to get_kernel_nofault()
-- it should validate that the pointer is a kernel pointer.

Is this really worth inlining instead of having the BPF JIT generate
an out of line call to a real C function?  That would let us put in a
sane implementation.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux