On Wed, Feb 01, 2023 at 02:48:22PM -0500, Peter Xu wrote:
I do also see a common pattern of the possibility to have a generic fault handler like generic_page_fault(). It probably should start with taking the mmap_sem until providing some retval that is much easier to digest further by the arch-dependent code, so it can directly do something rather than parsing the bitmask in a duplicated way (hence the new retval should hopefully not a bitmask anymore but a "what to do"). Maybe it can be something like: /** * enum page_fault_retval - Higher level fault retval, generalized from * vm_fault_reason above that is only used by hardware page fault handlers. * It generalizes the bitmask-versioned retval into something that the arch * dependent code should react upon. * * @PF_RET_COMPLETED: The page fault is completed successfully * @PF_RET_BAD_AREA: The page fault address falls in a bad area * (e.g., vma not found, expand_stack() fails..)
FWIW, there's a fun discrepancy - VM_FAULT_SIGSEGV may yield SEGV_MAPERR or SEGV_ACCERR; depends upon the architecture. Not that there'd been many places that return VM_FAULT_SIGSEGV these days... Good thing, too, since otherwise e.g. csky would oops...
* @PF_RET_ACCESS_ERR: The page fault has access errors * (e.g., write fault on !VM_WRITE vmas) * @PF_RET_KERN_FIXUP: The page fault requires kernel fixups * (e.g., during copy_to_user() but fault failed?) * @PF_RET_HWPOISON: The page fault encountered poisoned pages * @PF_RET_SIGNAL: The page fault encountered poisoned pages
??
* ... */ enum page_fault_retval { PF_RET_DONE = 0, PF_RET_BAD_AREA, PF_RET_ACCESS_ERR, PF_RET_KERN_FIXUP, PF_RET_HWPOISON, PF_RET_SIGNAL, ... }; As a start we may still want to return some more information (perhaps still the vm_fault_t alongside? Or another union that will provide different information based on different PF_RET_*). One major thing is I see how we handle VM_FAULT_HWPOISON and also the fact that we encode something more into the bitmask on page sizes (VM_FAULT_HINDEX_MASK). So the generic helper could, hopefully, hide the complexity of: - Taking and releasing of mmap lock - find_vma(), and also relevant checks on access or stack handling
Umm... arm is a bit special here: if (addr < FIRST_USER_ADDRESS) return VM_FAULT_BADMAP; with no counterparts elsewhere.
- handle_mm_fault() itself (of course...) - detect signals - handle page fault retries (so, in the new layer of retval there should have nothing telling it to retry; it should always be the ultimate result)
agreed. - unlock mmap; don't leave that to caller.
- parse different errors into "what the arch code should do", and generalize the common ones, e.g. - OOM, do pagefault_out_of_memory() for user-mode - VM_FAULT_SIGSEGV, which should be able to merge into PF_RET_BAD_AREA? - ...
AFAICS, all errors in kernel mode => no_context.
It'll simplify things if we can unify some small details like whether the -EFAULT above should contain a sigbus. A trivial detail I found when I was looking at this is, x86_64 passes in different signals to kernelmode_fixup_or_oops() - in do_user_addr_fault() there're three call sites and each of them pass over a differerent signal. IIUC that will only make a difference if there's a nested page fault during the vsyscall emulation (but I may be wrong too because I'm new to this code), and I have no idea when it'll happen and whether that needs to be strictly followed.
From my (very incomplete so far) dig through that pile:
Q: do we still have the cases when handle_mm_fault() does not return any of VM_FAULT_COMPLETED | VM_FAULT_RETRY | VM_FAULT_ERROR? That gets treated as unlock + VM_FAULT_COMPLETED, but do we still need that? Q: can VM_FAULT_RETRY be mixed with anything in VM_FAULT_ERROR? What locking, if that happens? * details of storing the fault details (for ptrace, mostly) vary a lot; no chance to unify, AFAICS. * requirements for vma flags also differ; e.g. read fault on alpha is explicitly OK with absence of VM_READ if VM_WRITE is there. Probably should go by way of arm and pass the mask that must have non-empty intersection with vma->vm_flags? Because *that* is very likely to be a part of ABI - mmap(2) callers that rely upon the flags being OK for given architecture are quite possible. * mmap lock is also quite variable in how it's taken; x86 and arm have fun dance with trylock/search for exception handler/etc. Other architectures do not; OTOH, there's a prefetch stuck in itanic variant, with comment about mmap_sem being performance-critical... * logics for stack expansion includes this twist: if (!(vma->vm_flags & VM_GROWSDOWN)) goto map_err; if (user_mode(regs)) { /* Accessing the stack below usp is always a bug. The "+ 256" is there due to some instructions doing pre-decrement on the stack and that doesn't show up until later. */ if (address + 256 < rdusp()) goto map_err; } if (expand_stack(vma, address)) goto map_err; That's m68k; ISTR similar considerations elsewhere, but I could be wrong.