On Fri, Jan 29, 2021 at 08:00:57AM -0800, Andy Lutomirski wrote: > > > On Jan 29, 2021, at 1:21 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > > On Thu, Jan 28, 2021 at 04:32:34PM -0800, Andy Lutomirski wrote: > > > >> I spoke too soon. get_kernel_nofault() is just fine. You have > >> inlined something like __get_kernel_nofault(), which is incorrect. > >> get_kernel_nofault() would have done the right thing. > > > > Correct, the pagefault_disable() is critical. What specifically are you referring to? As far as I can see the current->pagefault_disabled is an artifact of the past. It doesn't provide any additional information to the fault handler beyond what extable already does. Consider: current->pagefault_disabled++ some C code asm ("load") // with extable annotation some other C code current->pagefault_disabled-- If there is fault in the C code the fault handler will do the wrong thing, since the fault is technically disabled only for asm("load"). The handler will go into bad_area_nosemaphore() instead of find_vma(). The load instruction is annotated in extable. The fault handler instead of: if (faulthandler_disabled) search_exception_tables() could do: search_exception_tables() right away without sacrificing anything. If the fault is on one of the special asm("load") the intent of the code is obvious. This is non faulting load that should be fixed up. Of course the search of extable should happen before taking mmap lock. imo pagefault_disabled can be removed. > Also the _allowed() part. The bounds check is required. Up-thread I was saying that JIT is inlining copy_from_kernel_nofault(). That's not quite correct. When the code was written it was inlining bpf_probe_read(). Back then _kernel vs _user distinction didn't exist. So the bounds check wasn't there. The verifier side was designed for kernel pointers and NULL from the beginning. No user pointer (aside from NULL) would ever go through this path. Today I agree that the range check is necessary. The question is where to do this check. I see two options: - the JIT can emit it - fault handler can do it, since %rip clearly says that it's JITed asm load. The intent of the code is not ambiguous. The intent of the fault is not ambiguous either. More so the mmap lock and find_vma should not be done. I prefer the later approach which can be implemented as: diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index f1f1b5a0956a..7846a95436c1 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1248,6 +1248,12 @@ void do_user_addr_fault(struct pt_regs *regs, if (unlikely(kprobe_page_fault(regs, X86_TRAP_PF))) return; + if (!address && (e = search_bpf_extables(regs->ip))) { + handler = ex_fixup_handler(e); + handler(e, regs, trapnr, error_code, fault_addr); + return; + } + Comparing to former option it saves bpf prog cycles. Consider that users do not write a->b->c just to walk NULL pointers. The bpf program authors are not writing junk programs. If the program is walking the pointers it expects to read useful data. If one of those pointers can be NULL the program author will add if (!ptr) check there. The author will use blind a->b->c only for cases where these pointers are most likely will be !NULL. "Most likely" means that in 99.9% of the situations there will be no faults. When people write bpf tracing progs they care about the perf overhead. Obviously if prog is causing faults somebody will notice and will slap that bpf author for causing a performance regression. So faulting on NULL is very rare in practice. There is a selftests/bpf for faulting on NULL, but it's a test only. If the NULL check is added by JIT it will not be hit 99.9% of the time. It will be a pure overhead for program execution. Hence my preference for the latter approach.