Userland library functions such as allocators and threading implementations often require regions of memory to act as 'guard pages' - mappings which, when accessed, result in a fatal signal being sent to the accessing process. The current means by which these are implemented is via a PROT_NONE mmap() mapping, which provides the required semantics however incur an overhead of a VMA for each such region. With a great many processes and threads, this can rapidly add up and incur a significant memory penalty. It also has the added problem of preventing merges that might otherwise be permitted. This series takes a different approach - an idea suggested by Vlasimil Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the provenance becomes a little tricky to ascertain after this - please forgive any omissions!) - rather than locating the guard pages at the VMA layer, instead placing them in page tables mapping the required ranges. Early testing of the prototype version of this code suggests a 5 times speed up in memory mapping invocations (in conjunction with use of process_madvise()) and a 13% reduction in VMAs on an entirely idle android system and unoptimised code. We expect with optimisation and a loaded system with a larger number of guard pages this could significantly increase, but in any case these numbers are encouraging. This way, rather than having separate VMAs specifying which parts of a range are guard pages, instead we have a VMA spanning the entire range of memory a user is permitted to access and including ranges which are to be 'guarded'. After mapping this, a user can specify which parts of the range should result in a fatal signal when accessed. By restricting the ability to specify guard pages to memory mapped by existing VMAs, we can rely on the mappings being torn down when the mappings are ultimately unmapped and everything works simply as if the memory were not faulted in, from the point of view of the containing VMAs. This mechanism in effect poisons memory ranges similar to hardware memory poisoning, only it is an entirely software-controlled form of poisoning. The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL which installs page table-level guard page markers - and MADV_GUARD_REMOVE - which clears them. Guard markers can be installed across multiple VMAs and any existing mappings will be cleared, that is zapped, before installing the guard page markers in the page tables. There is no concept of 'nested' guard markers, multiple attempts to install guard markers in a range will, after the first attempt, have no effect. Importantly, removing guard markers over a range that contains both guard markers and ordinary backed memory has no effect on anything but the guard markers (including leaving huge pages un-split), so a user can safely remove guard markers over a range of memory leaving the rest intact. The actual mechanism by which the page table entries are specified makes use of existing logic - PTE markers, which are used for the userfaultfd UFFDIO_POISON mechanism. Unfortunately PTE_MARKER_POISONED is not suited for the guard page mechanism as it results in VM_FAULT_HWPOISON semantics in the fault handler, so we add our own specific PTE_MARKER_GUARD and adapt existing logic to handle it. We also extend the generic page walk mechanism to allow for installation of PTEs (carefully restricted to memory management logic only to prevent unwanted abuse). We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not remove guard markers, nor does forking (except when VM_WIPEONFORK is specified for a VMA which implies a total removal of memory characteristics). It's important to note that the guard page implementation is emphatically NOT a security feature, so a user can remove the markers if they wish. We simply implement it in such a way as to provide the least surprising behaviour. An extensive set of self-tests are provided which ensure behaviour is as expected and additionally self-documents expected behaviour of guard ranges. Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> Suggested-by: Jann Horn <jannh@xxxxxxxxxx> Suggested-by: David Hildenbrand <david@xxxxxxxxxx> v3 * Cleaned up mm/pagewalk.c logic a bit to make things clearer, as suggested by Vlastiml. * Explicitly avoid splitting THP on PTE installation, as suggested by Vlastimil. Note this has no impact on the guard pages logic, which has page table entry handlers at PUD, PMD and PTE level. * Added WARN_ON_ONCE() to mm/hugetlb.c path where we don't expect a guard marker, as suggested by Vlastimil. * Reverted change to is_poisoned_swp_entry() to exclude guard pages which has the effect of MADV_FREE _not_ clearing guard pages. After discussion with Vlastimil, it became apparent that the ability to 'cancel' the freeing operation by writing to the mapping after having issued an MADV_FREE would mean that we would risk unexpected behaviour should the guard pages be removed, so we now do not remove markers here at all. * Added comment to PTE_MARKER_GUARD to highlight that memory tagged with the marker behaves as if it were a region mapped PROT_NONE, as highlighted by David. * Rename poison -> install, unpoison -> remove (i.e. MADV_GUARD_INSTALL / MADV_GUARD_REMOVE over MADV_GUARD_POISON / MADV_GUARD_REMOVE) at the request of David and John who both find the poison analogy confusing/overloaded. * After a lot of discussion, replace the looping behaviour should page faults race with guard page installation with a modest reattempt followed by returning -ERESTARTNOINTR to have the operation abort and re-enter, relieving lock contention and avoiding the possibility of allowing a malicious sandboxed process to impact the mmap lock or stall the overall process more than necessary, as suggested by Jann and Vlastimil having raised the issue. * Adjusted the page table walker so a populated huge PUD or PMD is correctly treated as being populated, necessitating a zap. In v2 we incorrectly skipped over these, which would cause the logic to wrongly proceed as if nothing were populated and the install succeeded. Instead, explicitly check to see if a huge page - if so, do not split but rather abort the operation and let zap take care of things. * Updated the guard remove logic to not unnecessarily split huge pages either. * Added a debug check to assert that the number of installed PTEs matches expectation, accounting for any existing guard pages. * Adapted vector_madvise() used by the process_madvise() system call to handle -ERESTARTNOINTR correctly. v2 * The macros in kselftest_harness.h seem to be broken - __EXPECT() is terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not safe in single line if / else or for /which blocks, however working around this results in checkpatch producing invalid warnings, as reported by Shuah. * Fixing these macros is out of scope for this series, so compromise and instead rewrite test blocks so as to use multiple lines by separating out a decl in most cases. This has the side effect of, for the most part, making things more readable. * Heavily document the use of the volatile keyword - we can't avoid checkpatch complaining about this, so we explain it, as reported by Shuah. * Updated commit message to highlight that we skip tests we lack permissions for, as reported by Shuah. * Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah. * Added user friendly messages to cases where tests are skipped due to lack of permissions, as reported by Shuah. * Update the tool header to include the new MADV_GUARD_POISON/UNPOISON defines and directly include asm-generic/mman.h to get the platform-neutral versions to ensure we import them. * Finally fixed Vlastimil's email address in Suggested-by tags from suze to suse, as reported by Vlastimil. * Added linux-api to cc list, as reported by Vlastimil. https://lore.kernel.org/all/cover.1729440856.git.lorenzo.stoakes@xxxxxxxxxx/ v1 * Un-RFC'd as appears no major objections to approach but rather debate on implementation. * Fixed issue with arches which need mmu_context.h and tlbfush.h. header imports in pagewalker logic to be able to use update_mmu_cache() as reported by the kernel test bot. * Added comments in page walker logic to clarify who can use ops->install_pte and why as well as adding a check_ops_valid() helper function, as suggested by Christoph. * Pass false in full parameter in pte_clear_not_present_full() as suggested by Jann. * Stopped erroneously requiring a write lock for the poison operation as suggested by Jann and Suren. * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be consistent with how this is used elsewhere in the kernel as suggested by Jann. * Avoid returning -EAGAIN if we are raced on page faults, just keep looping and duck out if a fatal signal is pending or a conditional reschedule is needed, as suggested by Jann. * Avoid needlessly splitting huge PUDs and PMDs by specifying ACTION_CONTINUE, as suggested by Jann. https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@xxxxxxxxxx/ RFC https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@xxxxxxxxxx/ Lorenzo Stoakes (5): mm: pagewalk: add the ability to install PTEs mm: add PTE_MARKER_GUARD PTE marker mm: madvise: implement lightweight guard page mechanism tools: testing: update tools UAPI header for mman-common.h selftests/mm: add self tests for guard page feature arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/linux/mm_inline.h | 2 +- include/linux/pagewalk.h | 18 +- include/linux/swapops.h | 24 +- include/uapi/asm-generic/mman-common.h | 3 + mm/hugetlb.c | 4 + mm/internal.h | 12 + mm/madvise.c | 225 ++++ mm/memory.c | 18 +- mm/mprotect.c | 6 +- mm/mseal.c | 1 + mm/pagewalk.c | 227 +++- tools/include/uapi/asm-generic/mman-common.h | 3 + tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-pages.c | 1239 ++++++++++++++++++ 19 files changed, 1720 insertions(+), 76 deletions(-) create mode 100644 tools/testing/selftests/mm/guard-pages.c -- 2.47.0