Re: [PATCH bpf] x86/bpf: handle bpf-program-triggered exceptions properly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 1/27/21 1:47 PM, Andy Lutomirski wrote:
On Mon, Jan 25, 2021 at 4:12 PM Yonghong Song <yhs@xxxxxx> wrote:

When reviewing patch ([1]), which adds a script to run bpf selftest
through qemu at /sbin/init stage, I found the following kernel bug
warning:

[  112.118892] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1351
[  112.119805] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 354, name: new_name
[  112.120512] 3 locks held by new_name/354:
[  112.120868]  #0: ffff88800476e0a0 (&p->lock){+.+.}-{3:3}, at: bpf_seq_read+0x3a/0x3d0
[  112.121573]  #1: ffffffff82d69800 (rcu_read_lock){....}-{1:2}, at: bpf_iter_run_prog+0x5/0x160
[  112.122348]  #2: ffff8880061c2088 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x1a1/0x640
[  112.123128] Preemption disabled at:
[  112.123130] [<ffffffff8108f913>] migrate_disable+0x33/0x80
[  112.123942] CPU: 0 PID: 354 Comm: new_name Tainted: G           O      5.11.0-rc4-00524-g6e66fbb
10597-dirty #1249
[  112.124822] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01
/2014
[  112.125614] Call Trace:
[  112.125835]  dump_stack+0x77/0x97
[  112.126137]  ___might_sleep.cold.119+0xf2/0x106
[  112.126537]  exc_page_fault+0x1c1/0x640
[  112.126888]  asm_exc_page_fault+0x1e/0x30
[  112.127241] RIP: 0010:bpf_prog_0a182df2d34af188_dump_bpf_prog+0xf5/0xb3c
[  112.127825] Code: 00 00 8b 7d f4 41 8b 76 44 48 39 f7 73 06 48 01 fb 49 89 df 4c 89 7d d8 49 8b
bd 20 01 00 00 48 89 7d e0 49 8b bd e0 00 00 00 <48> 8b 7f 20 48 01 d7 48 89 7d e8 48 89 e9 48 83 c
1 d0 48 8b 7d c8
[  112.129433] RSP: 0018:ffffc9000035fdc8 EFLAGS: 00010282
[  112.129895] RAX: 0000000000000000 RBX: ffff888005a49458 RCX: 0000000000000024
[  112.130509] RDX: 00000000000002f0 RSI: 0000000000000509 RDI: 0000000000000000
[  112.131126] RBP: ffffc9000035fe20 R08: 0000000000000001 R09: 0000000000000000
[  112.131737] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000400
[  112.132355] R13: ffff888006085800 R14: ffff888004718540 R15: ffff888005a49458
[  112.132990]  ? bpf_prog_0a182df2d34af188_dump_bpf_prog+0xc8/0xb3c
[  112.133526]  bpf_iter_run_prog+0x75/0x160
[  112.133880]  __bpf_prog_seq_show+0x39/0x40
[  112.134258]  bpf_seq_read+0xf6/0x3d0
[  112.134582]  vfs_read+0xa3/0x1b0
[  112.134873]  ksys_read+0x4f/0xc0
[  112.135166]  do_syscall_64+0x2d/0x40
[  112.135482]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

To reproduce the issue, with patch [1] and use the following script:
   tools/testing/selftests/bpf/run_in_vm.sh -- cat /sys/fs/bpf/progs.debug

The reason of the above kernel warning is due to bpf program
tries to dereference an address of 0 and which is not caught
by bpf exception handling logic.

...
SEC("iter/bpf_prog")
int dump_bpf_prog(struct bpf_iter__bpf_prog *ctx)
{
         struct bpf_prog *prog = ctx->prog;
         struct bpf_prog_aux *aux;
         ...
         if (!prog)
                 return 0;
         aux = prog->aux;
         ...
         ... aux->dst_prog->aux->name ...
         return 0;
}

If the aux->dst_prog is NULL pointer, a fault will happen when trying
to access aux->dst_prog->aux.


Which would be a bug in the bpf verifier, no?  This is a bpf program
that apparently passed verification, and it somehow dereferenced a
NULL pointer.

Let's enumerate some cases.

1. x86-like architecture, SMAP enabled.  The CPU determines that this
is bogus, and bpf gets lucky, because the x86 code runs the exception
handler instead of summarily OOPSing.  IMO it would be entirely
reasonable to OOPS.

2 x86-like architecture, SMAP disabled.  This looks like a valid user
access, and for all bpf knows, 0 might be an actual mapped address,
and it might have userfaultfd attached, etc.  And it's called from a
non-sleepable context.  The warning is entirely correct.

3. s390x-like architecture.  This is a dereference of a bad kernel
pointer.  OOPS, unless you get lucky.


This patch is totally bogus.  You can't work around this by some magic
exception handling.  Unless I'm missing something big, this is a bpf
bug.  The following is not a valid kernel code pattern:

label:
   dereference might-be-NULL kernel pointer
_EXHANDLER_...(label, ...); /* try to recover */

This appears to have been introduced by:

commit 3dec541b2e632d630fe7142ed44f0b3702ef1f8c
Author: Alexei Starovoitov <ast@xxxxxxxxxx>
Date:   Tue Oct 15 20:25:03 2019 -0700

     bpf: Add support for BTF pointers to x86 JIT

Alexei, this looks like a very long-winded and ultimately incorrect
way to remove the branch from:

if (ptr)
   val = *ptr;

Wouldn't it be better to either just emit the branch directly or to
make sure that the pointer is valid in the first place?

Let me explain the motivation for this patch.

Previously, for any kernel data structure access,
people has to use bpf_probe_read() or bpf_probe_read_kernel()
helper even most of these accesses are okay and will not
fault. For example, for
   int t = a->b->c->d
three bpf_probe_read() will be needed, e.g.,
   bpf_probe_read_kernel(&t1, sizeof(t1), &a->b);
   bpf_probe_read_kernel(&t2, sizeof(t2), &t1->c);
   bpf_probe_read_kernel(&t, sizeof(t), &t2->d);

if there is a fault, bpf_probe_read_kernel() helper will
suppress the exception and clears the dest buffer and
return.

The above usage of bpf_probe_read_kernel()
is complicated and not C like and bpf developers
does not like it.

bcc (https://github.com/iovisor/bcc/) permits
users to write "a->b->c->d" styles and then through
clang rewriter to convert it to a series of
bpf_probe_read_kernel()'s. But most users are
directly using clang to compile their programs so
they have to write bpf_probe_read_kernel()'s
by themselves.

The motivation here is to improve user experience so
user can just write
   int t = a->b->c->d
some kernel will automatically take care of exceptions
and maintain bpf_probe_read_kernel() semantics.
So what the above patch essentially did is to check if the "regs->ip"
is one of ips which try to a "bpf_probe_read_kernel()" (actually
a direct access), it will fix up exception (clear the dest register)
and returns.

For a->b->c->d, some users may add "if (ptr) check"
for some of them and if that is the
case, compiler/verifier will honor that. Some users
may not add if they are certain from code that in most
or all cases, pointer will not be null. From verifier
perspective, it will be hard to decide whether
a->b, a->b->c is null or not, adding null checks
to every kernel de-references might be excessive. Also,
not 100% sure whether with null check, the pointer
dereference will absolutely not produce fault or not.
If there is no such guarantee then bpf_probe_read_kernel()
will be still needed.

I see page fault handler specially processed page fault for
kprobe and vsyscall. maybe page faults generated by special
bpf insns (simulating bpf_probe_read_kernel()) can also
be specially processed here.


--Andy




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux