On Mon, 4 Nov 2024 at 11:16, Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > On 11/3/24 11:35, Kumar Kartikeya Dwivedi wrote: > > Currently, on x86, when SMAP is enabled, and a page fault occurs in > > kernel mode for accessing a user address, the kernel will rightly panic > > as no valid kernel code can cause such a page fault (unless buggy). > > There is no valid correct kernel code that can generate such a fault, > > therefore this behavior would be correct. > > > > BPF programs that currently encounter user addresses when doing > > PROBE_MEM loads (load instructions which are allowed to read any kernel > > address, only available for root users) avoid a page fault by performing > > bounds checking on the address. This requires the JIT to emit a jump > > over each PROBE_MEM load instruction to avoid hitting page faults. > > To be honest, I think the overhead (and complexity) is in the right spot > today: the BPF program. I don't want to complicate the x86 page fault > handler for a debugging feature that already has a functional solution > today. This is not a debugging feature in BPF programs. Almost all tracing BPF programs that potentially execute inside the kernel contain loads which have to be marked PROBE_MEM because potentially, they may read a user pointer and dereference it. The BPF runtime relies on PROBE_MEM whenever reading from pointers (for programs having root) that cannot be trusted to be safe statically. While reading invalid memory is a rare case that should not happen, it may be possible if some kernel field contains a stale address etc. so loads into such pointers are allowed statically, and handled/trapped at runtime to zero out the destination register. So far x86 won't invoke fixup_exception for user faults, as it wasn't supposed to happen, which is correct behavior (hence the bounds check). However, this cost is paid to handle a rare case in the common path of BPF programs. It would be nice to avoid it, by deferring to fixup_exception->ex_handler_bpf in case the pc belongs to BPF text. This was the least intrusive way I could think of doing it. On arm64, the fault handler would do exception handling in such a case. So with this patch we can think of removing the bounds checking across architectures. We can do something similar for RISC-V as well. I am fine with changing the commit log/title to reflect that it's an added cost for the fault handler. I felt that it wouldn't matter because the kernel was about to panic, so we had bailed anyway, but we _do_ end up spending extra cycles in case this path is hit now.