+CC linux-api (also should on future revisions) On 10/17/24 22:42, Lorenzo Stoakes wrote: > Userland library functions such as allocators and threading implementations > often require regions of memory to act as 'guard pages' - mappings which, > when accessed, result in a fatal signal being sent to the accessing > process. > > The current means by which these are implemented is via a PROT_NONE mmap() > mapping, which provides the required semantics however incur an overhead of > a VMA for each such region. > > With a great many processes and threads, this can rapidly add up and incur > a significant memory penalty. It also has the added problem of preventing > merges that might otherwise be permitted. > > This series takes a different approach - an idea suggested by Vlasimil > Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the > provenance becomes a little tricky to ascertain after this - please forgive > any omissions!) - rather than locating the guard pages at the VMA layer, > instead placing them in page tables mapping the required ranges. > > Early testing of the prototype version of this code suggests a 5 times > speed up in memory mapping invocations (in conjunction with use of > process_madvise()) and a 13% reduction in VMAs on an entirely idle android > system and unoptimised code. > > We expect with optimisation and a loaded system with a larger number of > guard pages this could significantly increase, but in any case these > numbers are encouraging. > > This way, rather than having separate VMAs specifying which parts of a > range are guard pages, instead we have a VMA spanning the entire range of > memory a user is permitted to access and including ranges which are to be > 'guarded'. > > After mapping this, a user can specify which parts of the range should > result in a fatal signal when accessed. > > By restricting the ability to specify guard pages to memory mapped by > existing VMAs, we can rely on the mappings being torn down when the > mappings are ultimately unmapped and everything works simply as if the > memory were not faulted in, from the point of view of the containing VMAs. > > This mechanism in effect poisons memory ranges similar to hardware memory > poisoning, only it is an entirely software-controlled form of poisoning. > > Any poisoned region of memory is also able to 'unpoisoned', that is, to > have its poison markers removed. > > The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON > which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this > poisoning. > > Poisoning can be performed across multiple VMAs and any existing mappings > will be cleared, that is zapped, before installing the poisoned page table > mappings. > > There is no concept of 'nested' poisoning, multiple attempts to poison a > range will, after the first poisoning, have no effect. > > Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned > memory, so a user can safely unpoison a range of memory and clear only > poison page table mappings leaving the rest intact. > > The actual mechanism by which the page table entries are specified makes > use of existing logic - PTE markers, which are used for the userfaultfd > UFFDIO_POISON mechanism. > > Unfortunately PTE_MARKER_POISONED is not suited for the guard page > mechanism as it results in VM_FAULT_HWPOISON semantics in the fault > handler, so we add our own specific PTE_MARKER_GUARD and adapt existing > logic to handle it. > > We also extend the generic page walk mechanism to allow for installation of > PTEs (carefully restricted to memory management logic only to prevent > unwanted abuse). > > We ensure that zapping performed by, for instance, MADV_DONTNEED, does not > remove guard poison markers, nor does forking (except when VM_WIPEONFORK is > specified for a VMA which implies a total removal of memory > characteristics). > > It's important to note that the guard page implementation is emphatically > NOT a security feature, so a user can remove the poisoning if they wish. We > simply implement it in such a way as to provide the least surprising > behaviour. > > An extensive set of self-tests are provided which ensure behaviour is as > expected and additionally self-documents expected behaviour of poisoned > ranges. > > Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> Please fix the domain typo (also in patch 3 :) Thanks for implementing this, Vlastimil > Suggested-by: Jann Horn <jannh@xxxxxxxxxx> > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > v1 > * Un-RFC'd as appears no major objections to approach but rather debate on > implementation. > * Fixed issue with arches which need mmu_context.h and > tlbfush.h. header imports in pagewalker logic to be able to use > update_mmu_cache() as reported by the kernel test bot. > * Added comments in page walker logic to clarify who can use > ops->install_pte and why as well as adding a check_ops_valid() helper > function, as suggested by Christoph. > * Pass false in full parameter in pte_clear_not_present_full() as suggested > by Jann. > * Stopped erroneously requiring a write lock for the poison operation as > suggested by Jann and Suren. > * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be > consistent with how this is used elsewhere in the kernel as suggested by > Jann. > * Avoid returning -EAGAIN if we are raced on page faults, just keep looping > and duck out if a fatal signal is pending or a conditional reschedule is > needed, as suggested by Jann. > * Avoid needlessly splitting huge PUDs and PMDs by specifying > ACTION_CONTINUE, as suggested by Jann. > > RFC > https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@xxxxxxxxxx/ > > Lorenzo Stoakes (4): > mm: pagewalk: add the ability to install PTEs > mm: add PTE_MARKER_GUARD PTE marker > mm: madvise: implement lightweight guard page mechanism > selftests/mm: add self tests for guard page feature > > arch/alpha/include/uapi/asm/mman.h | 3 + > arch/mips/include/uapi/asm/mman.h | 3 + > arch/parisc/include/uapi/asm/mman.h | 3 + > arch/xtensa/include/uapi/asm/mman.h | 3 + > include/linux/mm_inline.h | 2 +- > include/linux/pagewalk.h | 18 +- > include/linux/swapops.h | 26 +- > include/uapi/asm-generic/mman-common.h | 3 + > mm/hugetlb.c | 3 + > mm/internal.h | 6 + > mm/madvise.c | 168 ++++ > mm/memory.c | 18 +- > mm/mprotect.c | 3 +- > mm/mseal.c | 1 + > mm/pagewalk.c | 200 ++-- > tools/testing/selftests/mm/.gitignore | 1 + > tools/testing/selftests/mm/Makefile | 1 + > tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ > 18 files changed, 1564 insertions(+), 66 deletions(-) > create mode 100644 tools/testing/selftests/mm/guard-pages.c > > -- > 2.46.2