Re: [PATCH bpf] x86/bpf: handle bpf-program-triggered exceptions properly

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Fri, 29 Jan 2021 15:30:39 -0800

On Fri, Jan 29, 2021 at 3:12 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Jan 29, 2021 at 08:00:57AM -0800, Andy Lutomirski wrote:
> >
> > > On Jan 29, 2021, at 1:21 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Thu, Jan 28, 2021 at 04:32:34PM -0800, Andy Lutomirski wrote:
> > >
> > >> I spoke too soon.  get_kernel_nofault() is just fine.  You have
> > >> inlined something like __get_kernel_nofault(), which is incorrect.
> > >> get_kernel_nofault() would have done the right thing.
> > >
> > > Correct, the pagefault_disable() is critical.
>
> What specifically are you referring to?
> As far as I can see the current->pagefault_disabled is an artifact of the past.
> It doesn't provide any additional information to the fault handler beyond
> what extable already does. Consider:
>
> current->pagefault_disabled++
> some C code
> asm ("load") // with extable annotation
> some other C code
> current->pagefault_disabled--

pagefault_disabled is not about providing information to the fault
handler.  It's about changing the semantics of a fault when accessing
a user pointer.  There are two choices of semantics:

1. normal (faulthandler_disabled() returns false): accesses to user
memory can sleep, and accesses to valid VMAs are resolved.  exception
handlers for valid memory accesses are not called if the fault is
resolved.

2. faulthandler_disabled() returns true: accesses to user memory
cannot sleep.  If actual processing would be needed to resolve the
fault, it's an error instead.

This is used for two purposes: optimistic locking and best-effort
tracing.  There are code paths that might prefer to access user memory
with locks held.  They can pagefault_disable(), try the access, and
take the slow path involving dropping locks if the first try fails.
And tracing (e.g. perf call stacks) might need to access user memory
in a context in which they cannot sleep.  They can disable pagefaults
and try.  If they fail, then the user gets an incomplete call stack.

This is specifically about user memory access.  Sure, bpf could do
pagefault_disable(), but that makes no sense -- bpf is trying to do a
*kernel* access and is not validating the pointer.  On x86, which is a
poor architecture in this regard, the memory access instructions don't
adequately distinguish between user and kernel access, and bounds
checks are necessary.

>
> If there is fault in the C code the fault handler will do the wrong thing,
> since the fault is technically disabled only for asm("load").
> The handler will go into bad_area_nosemaphore() instead of find_vma().
>
> The load instruction is annotated in extable.
> The fault handler instead of:
>   if (faulthandler_disabled) search_exception_tables()
> could do:
>  search_exception_tables()
> right away without sacrificing anything.

This would sacrifice correctness in the vastly more common case of
get_user() / put_user() / copy_to/from_user(), which wants to sleep,
not return -FAULT.

> If the fault is on one of the special asm("load") the intent of the code
> is obvious. This is non faulting load that should be fixed up.
> Of course the search of extable should happen before taking mmap lock.
>
> imo pagefault_disabled can be removed.
>
> > Also the _allowed() part. The bounds check is required.
>
> Up-thread I was saying that JIT is inlining copy_from_kernel_nofault().
> That's not quite correct. When the code was written it was inlining
> bpf_probe_read(). Back then _kernel vs _user distinction didn't exist.
> So the bounds check wasn't there. The verifier side was designed
> for kernel pointers and NULL from the beginning. No user pointer
> (aside from NULL) would ever go through this path.
> Today I agree that the range check is necessary.
> The question is where to do this check.
> I see two options:
> - the JIT can emit it
> - fault handler can do it, since %rip clearly says that it's JITed asm load.

The fault handler *can't* emit it correctly because the zero page
might actually be mapped.

Frankly, inlining this at all is a bit bogus, because some day we're
going to improve the range check to check whether the pointer points
to actual memory, and BPF should really do the same thing.  This will
be too big to inline nicely.

> The intent of the code is not ambiguous.
> The intent of the fault is not ambiguous either.
> More so the mmap lock and find_vma should not be done.
>
> I prefer the later approach which can be implemented as:
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f1f1b5a0956a..7846a95436c1 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1248,6 +1248,12 @@ void do_user_addr_fault(struct pt_regs *regs,
>         if (unlikely(kprobe_page_fault(regs, X86_TRAP_PF)))
>                 return;
>
> +       if (!address && (e = search_bpf_extables(regs->ip))) {

That's quite the set of cache lines that ought not to be touched in
the normal fault path.