On Mon, Oct 28, 2024 at 02:13:26PM +0000, Lorenzo Stoakes wrote: > Userland library functions such as allocators and threading implementations > often require regions of memory to act as 'guard pages' - mappings which, > when accessed, result in a fatal signal being sent to the accessing > process. > > The current means by which these are implemented is via a PROT_NONE mmap() > mapping, which provides the required semantics however incur an overhead of > a VMA for each such region. > > With a great many processes and threads, this can rapidly add up and incur > a significant memory penalty. It also has the added problem of preventing > merges that might otherwise be permitted. > > This series takes a different approach - an idea suggested by Vlasimil > Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the > provenance becomes a little tricky to ascertain after this - please forgive > any omissions!) - rather than locating the guard pages at the VMA layer, > instead placing them in page tables mapping the required ranges. > > Early testing of the prototype version of this code suggests a 5 times > speed up in memory mapping invocations (in conjunction with use of > process_madvise()) and a 13% reduction in VMAs on an entirely idle android > system and unoptimised code. > > We expect with optimisation and a loaded system with a larger number of > guard pages this could significantly increase, but in any case these > numbers are encouraging. > > This way, rather than having separate VMAs specifying which parts of a > range are guard pages, instead we have a VMA spanning the entire range of > memory a user is permitted to access and including ranges which are to be > 'guarded'. > > After mapping this, a user can specify which parts of the range should > result in a fatal signal when accessed. > > By restricting the ability to specify guard pages to memory mapped by > existing VMAs, we can rely on the mappings being torn down when the > mappings are ultimately unmapped and everything works simply as if the > memory were not faulted in, from the point of view of the containing VMAs. > > This mechanism in effect poisons memory ranges similar to hardware memory > poisoning, only it is an entirely software-controlled form of poisoning. > > The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL > which installs page table-level guard page markers - and > MADV_GUARD_REMOVE - which clears them. > > Guard markers can be installed across multiple VMAs and any existing > mappings will be cleared, that is zapped, before installing the guard page > markers in the page tables. > > There is no concept of 'nested' guard markers, multiple attempts to install > guard markers in a range will, after the first attempt, have no effect. > > Importantly, removing guard markers over a range that contains both guard > markers and ordinary backed memory has no effect on anything but the guard > markers (including leaving huge pages un-split), so a user can safely > remove guard markers over a range of memory leaving the rest intact. > > The actual mechanism by which the page table entries are specified makes > use of existing logic - PTE markers, which are used for the userfaultfd > UFFDIO_POISON mechanism. > > Unfortunately PTE_MARKER_POISONED is not suited for the guard page > mechanism as it results in VM_FAULT_HWPOISON semantics in the fault > handler, so we add our own specific PTE_MARKER_GUARD and adapt existing > logic to handle it. > > We also extend the generic page walk mechanism to allow for installation of > PTEs (carefully restricted to memory management logic only to prevent > unwanted abuse). > > We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not > remove guard markers, nor does forking (except when VM_WIPEONFORK is > specified for a VMA which implies a total removal of memory > characteristics). > > It's important to note that the guard page implementation is emphatically > NOT a security feature, so a user can remove the markers if they wish. We > simply implement it in such a way as to provide the least surprising > behaviour. > > An extensive set of self-tests are provided which ensure behaviour is as > expected and additionally self-documents expected behaviour of guard > ranges. Dear Lorenzo, Dear colleagues, sorry about raising an old thread. It looks like this feature is now used in glibc [1]. And we noticed failures in CRIU [2] CI on Fedora Rawhide userspace. Now a question is how we can properly detect such "guarded" pages from user space. As I can see from MADV_GUARD_INSTALL implementation, it does not modify VMA flags anyhow, but only page tables. It means that /proc/<pid>/maps and /proc/<pid>/smaps interfaces are useless in this case. (Please, correct me if I'm missing anything here.) I wonder if you have any ideas / suggestions regarding Checkpoint/Restore here. We (CRIU devs) are happy to develop some patches to bring some uAPI to expose MADV_GUARDs, but before going into this we decided to raise this question in LKML. +CC criu@xxxxxxxxxxxxxxx +CC Andrei Vagin <avagin@xxxxxxxxx> +CC Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx> Kind regards, Alex [1] https://github.com/bminor/glibc/commit/a6fbe36b7f31292981422692236465ab56670ea9 [2] https://github.com/checkpoint-restore/criu/pull/2625 > > Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> > Suggested-by: Jann Horn <jannh@xxxxxxxxxx> > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > v4 > * Use restart_syscall() to implement -ERESTARTNOINTR to ensure correctly > handled by kernel - tested this code path and confirmed it works > correctly. Thanks to Vlastimil for pointing this issue out! > * Updated the vector_madvise() handler to not unnecessarily invoke > cond_resched() as suggested by Vlastimil. > * Updated guard page tests to add a test for a vector operation which > overwrites existing mappings. Tested this against the -ERESTARTNOINTR > case and confirmed working. > * Improved page walk logic further, refactoring handling logic as suggested > by Vlastimil. > * Moved MAX_MADVISE_GUARD_RETRIES to mm/madvise.c as suggested by Vlastimil. > > v3 > * Cleaned up mm/pagewalk.c logic a bit to make things clearer, as suggested > by Vlastiml. > * Explicitly avoid splitting THP on PTE installation, as suggested by > Vlastimil. Note this has no impact on the guard pages logic, which has > page table entry handlers at PUD, PMD and PTE level. > * Added WARN_ON_ONCE() to mm/hugetlb.c path where we don't expect a guard > marker, as suggested by Vlastimil. > * Reverted change to is_poisoned_swp_entry() to exclude guard pages which > has the effect of MADV_FREE _not_ clearing guard pages. After discussion > with Vlastimil, it became apparent that the ability to 'cancel' the > freeing operation by writing to the mapping after having issued an > MADV_FREE would mean that we would risk unexpected behaviour should the > guard pages be removed, so we now do not remove markers here at all. > * Added comment to PTE_MARKER_GUARD to highlight that memory tagged with > the marker behaves as if it were a region mapped PROT_NONE, as > highlighted by David. > * Rename poison -> install, unpoison -> remove (i.e. MADV_GUARD_INSTALL / > MADV_GUARD_REMOVE over MADV_GUARD_POISON / MADV_GUARD_REMOVE) at the > request of David and John who both find the poison analogy > confusing/overloaded. > * After a lot of discussion, replace the looping behaviour should page > faults race with guard page installation with a modest reattempt followed > by returning -ERESTARTNOINTR to have the operation abort and re-enter, > relieving lock contention and avoiding the possibility of allowing a > malicious sandboxed process to impact the mmap lock or stall the overall > process more than necessary, as suggested by Jann and Vlastimil having > raised the issue. > * Adjusted the page table walker so a populated huge PUD or PMD is > correctly treated as being populated, necessitating a zap. In v2 we > incorrectly skipped over these, which would cause the logic to wrongly > proceed as if nothing were populated and the install succeeded. > Instead, explicitly check to see if a huge page - if so, do not split but > rather abort the operation and let zap take care of things. > * Updated the guard remove logic to not unnecessarily split huge pages > either. > * Added a debug check to assert that the number of installed PTEs matches > expectation, accounting for any existing guard pages. > * Adapted vector_madvise() used by the process_madvise() system call to > handle -ERESTARTNOINTR correctly. > https://lore.kernel.org/all/cover.1729699916.git.lorenzo.stoakes@xxxxxxxxxx/ > > v2 > * The macros in kselftest_harness.h seem to be broken - __EXPECT() is > terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not > safe in single line if / else or for /which blocks, however working > around this results in checkpatch producing invalid warnings, as reported > by Shuah. > * Fixing these macros is out of scope for this series, so compromise and > instead rewrite test blocks so as to use multiple lines by separating out > a decl in most cases. This has the side effect of, for the most part, > making things more readable. > * Heavily document the use of the volatile keyword - we can't avoid > checkpatch complaining about this, so we explain it, as reported by > Shuah. > * Updated commit message to highlight that we skip tests we lack > permissions for, as reported by Shuah. > * Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah. > * Added user friendly messages to cases where tests are skipped due to lack > of permissions, as reported by Shuah. > * Update the tool header to include the new MADV_GUARD_POISON/UNPOISON > defines and directly include asm-generic/mman.h to get the > platform-neutral versions to ensure we import them. > * Finally fixed Vlastimil's email address in Suggested-by tags from suze to > suse, as reported by Vlastimil. > * Added linux-api to cc list, as reported by Vlastimil. > https://lore.kernel.org/all/cover.1729440856.git.lorenzo.stoakes@xxxxxxxxxx/ > > v1 > * Un-RFC'd as appears no major objections to approach but rather debate on > implementation. > * Fixed issue with arches which need mmu_context.h and > tlbfush.h. header imports in pagewalker logic to be able to use > update_mmu_cache() as reported by the kernel test bot. > * Added comments in page walker logic to clarify who can use > ops->install_pte and why as well as adding a check_ops_valid() helper > function, as suggested by Christoph. > * Pass false in full parameter in pte_clear_not_present_full() as suggested > by Jann. > * Stopped erroneously requiring a write lock for the poison operation as > suggested by Jann and Suren. > * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be > consistent with how this is used elsewhere in the kernel as suggested by > Jann. > * Avoid returning -EAGAIN if we are raced on page faults, just keep looping > and duck out if a fatal signal is pending or a conditional reschedule is > needed, as suggested by Jann. > * Avoid needlessly splitting huge PUDs and PMDs by specifying > ACTION_CONTINUE, as suggested by Jann. > https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@xxxxxxxxxx/ > > RFC > https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@xxxxxxxxxx/ > > Lorenzo Stoakes (5): > mm: pagewalk: add the ability to install PTEs > mm: add PTE_MARKER_GUARD PTE marker > mm: madvise: implement lightweight guard page mechanism > tools: testing: update tools UAPI header for mman-common.h > selftests/mm: add self tests for guard page feature > > arch/alpha/include/uapi/asm/mman.h | 3 + > arch/mips/include/uapi/asm/mman.h | 3 + > arch/parisc/include/uapi/asm/mman.h | 3 + > arch/xtensa/include/uapi/asm/mman.h | 3 + > include/linux/mm_inline.h | 2 +- > include/linux/pagewalk.h | 18 +- > include/linux/swapops.h | 24 +- > include/uapi/asm-generic/mman-common.h | 3 + > mm/hugetlb.c | 4 + > mm/internal.h | 6 + > mm/madvise.c | 239 ++++ > mm/memory.c | 18 +- > mm/mprotect.c | 6 +- > mm/mseal.c | 1 + > mm/pagewalk.c | 246 +++- > tools/include/uapi/asm-generic/mman-common.h | 3 + > tools/testing/selftests/mm/.gitignore | 1 + > tools/testing/selftests/mm/Makefile | 1 + > tools/testing/selftests/mm/guard-pages.c | 1243 ++++++++++++++++++ > 19 files changed, 1751 insertions(+), 76 deletions(-) > create mode 100644 tools/testing/selftests/mm/guard-pages.c > > -- > 2.47.0