On Fri, Oct 18, 2024 at 05:17:56PM +0100, Lorenzo Stoakes wrote: > On Fri, Oct 18, 2024 at 06:10:37PM +0200, Vlastimil Babka wrote: > > +CC linux-api (also should on future revisions) > > > > They're cc'd :) assuming Linux API <linux-api@xxxxxxxxxxxxxxx> is correct > right? As discussed on IRC, no I was being a little slow here and hadn't realised you'd added them, apologies! Will add them on future respins, sorry guys :) > > > On 10/17/24 22:42, Lorenzo Stoakes wrote: > > > Userland library functions such as allocators and threading implementations > > > often require regions of memory to act as 'guard pages' - mappings which, > > > when accessed, result in a fatal signal being sent to the accessing > > > process. > > > > > > The current means by which these are implemented is via a PROT_NONE mmap() > > > mapping, which provides the required semantics however incur an overhead of > > > a VMA for each such region. > > > > > > With a great many processes and threads, this can rapidly add up and incur > > > a significant memory penalty. It also has the added problem of preventing > > > merges that might otherwise be permitted. > > > > > > This series takes a different approach - an idea suggested by Vlasimil > > > Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the > > > provenance becomes a little tricky to ascertain after this - please forgive > > > any omissions!) - rather than locating the guard pages at the VMA layer, > > > instead placing them in page tables mapping the required ranges. > > > > > > Early testing of the prototype version of this code suggests a 5 times > > > speed up in memory mapping invocations (in conjunction with use of > > > process_madvise()) and a 13% reduction in VMAs on an entirely idle android > > > system and unoptimised code. > > > > > > We expect with optimisation and a loaded system with a larger number of > > > guard pages this could significantly increase, but in any case these > > > numbers are encouraging. > > > > > > This way, rather than having separate VMAs specifying which parts of a > > > range are guard pages, instead we have a VMA spanning the entire range of > > > memory a user is permitted to access and including ranges which are to be > > > 'guarded'. > > > > > > After mapping this, a user can specify which parts of the range should > > > result in a fatal signal when accessed. > > > > > > By restricting the ability to specify guard pages to memory mapped by > > > existing VMAs, we can rely on the mappings being torn down when the > > > mappings are ultimately unmapped and everything works simply as if the > > > memory were not faulted in, from the point of view of the containing VMAs. > > > > > > This mechanism in effect poisons memory ranges similar to hardware memory > > > poisoning, only it is an entirely software-controlled form of poisoning. > > > > > > Any poisoned region of memory is also able to 'unpoisoned', that is, to > > > have its poison markers removed. > > > > > > The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON > > > which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this > > > poisoning. > > > > > > Poisoning can be performed across multiple VMAs and any existing mappings > > > will be cleared, that is zapped, before installing the poisoned page table > > > mappings. > > > > > > There is no concept of 'nested' poisoning, multiple attempts to poison a > > > range will, after the first poisoning, have no effect. > > > > > > Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned > > > memory, so a user can safely unpoison a range of memory and clear only > > > poison page table mappings leaving the rest intact. > > > > > > The actual mechanism by which the page table entries are specified makes > > > use of existing logic - PTE markers, which are used for the userfaultfd > > > UFFDIO_POISON mechanism. > > > > > > Unfortunately PTE_MARKER_POISONED is not suited for the guard page > > > mechanism as it results in VM_FAULT_HWPOISON semantics in the fault > > > handler, so we add our own specific PTE_MARKER_GUARD and adapt existing > > > logic to handle it. > > > > > > We also extend the generic page walk mechanism to allow for installation of > > > PTEs (carefully restricted to memory management logic only to prevent > > > unwanted abuse). > > > > > > We ensure that zapping performed by, for instance, MADV_DONTNEED, does not > > > remove guard poison markers, nor does forking (except when VM_WIPEONFORK is > > > specified for a VMA which implies a total removal of memory > > > characteristics). > > > > > > It's important to note that the guard page implementation is emphatically > > > NOT a security feature, so a user can remove the poisoning if they wish. We > > > simply implement it in such a way as to provide the least surprising > > > behaviour. > > > > > > An extensive set of self-tests are provided which ensure behaviour is as > > > expected and additionally self-documents expected behaviour of poisoned > > > ranges. > > > > > > Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> > > > > Please fix the domain typo (also in patch 3 :) > > > > Damnnn it! I can't believe I left that in. Sorry about that! Will fix on > respin. > > Hopefully not to suse.cs ;) > > > Thanks for implementing this, > > Vlastimil > > Thanks! > > > > > > Suggested-by: Jann Horn <jannh@xxxxxxxxxx> > > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > > > > > v1 > > > * Un-RFC'd as appears no major objections to approach but rather debate on > > > implementation. > > > * Fixed issue with arches which need mmu_context.h and > > > tlbfush.h. header imports in pagewalker logic to be able to use > > > update_mmu_cache() as reported by the kernel test bot. > > > * Added comments in page walker logic to clarify who can use > > > ops->install_pte and why as well as adding a check_ops_valid() helper > > > function, as suggested by Christoph. > > > * Pass false in full parameter in pte_clear_not_present_full() as suggested > > > by Jann. > > > * Stopped erroneously requiring a write lock for the poison operation as > > > suggested by Jann and Suren. > > > * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be > > > consistent with how this is used elsewhere in the kernel as suggested by > > > Jann. > > > * Avoid returning -EAGAIN if we are raced on page faults, just keep looping > > > and duck out if a fatal signal is pending or a conditional reschedule is > > > needed, as suggested by Jann. > > > * Avoid needlessly splitting huge PUDs and PMDs by specifying > > > ACTION_CONTINUE, as suggested by Jann. > > > > > > RFC > > > https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@xxxxxxxxxx/ > > > > > > Lorenzo Stoakes (4): > > > mm: pagewalk: add the ability to install PTEs > > > mm: add PTE_MARKER_GUARD PTE marker > > > mm: madvise: implement lightweight guard page mechanism > > > selftests/mm: add self tests for guard page feature > > > > > > arch/alpha/include/uapi/asm/mman.h | 3 + > > > arch/mips/include/uapi/asm/mman.h | 3 + > > > arch/parisc/include/uapi/asm/mman.h | 3 + > > > arch/xtensa/include/uapi/asm/mman.h | 3 + > > > include/linux/mm_inline.h | 2 +- > > > include/linux/pagewalk.h | 18 +- > > > include/linux/swapops.h | 26 +- > > > include/uapi/asm-generic/mman-common.h | 3 + > > > mm/hugetlb.c | 3 + > > > mm/internal.h | 6 + > > > mm/madvise.c | 168 ++++ > > > mm/memory.c | 18 +- > > > mm/mprotect.c | 3 +- > > > mm/mseal.c | 1 + > > > mm/pagewalk.c | 200 ++-- > > > tools/testing/selftests/mm/.gitignore | 1 + > > > tools/testing/selftests/mm/Makefile | 1 + > > > tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++ > > > 18 files changed, 1564 insertions(+), 66 deletions(-) > > > create mode 100644 tools/testing/selftests/mm/guard-pages.c > > > > > > -- > > > 2.46.2 > >