On Wed, Mar 19, 2025 at 3:53 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 19.03.25 15:50, Alexander Mikhalitsyn wrote: > > On Mon, Oct 28, 2024 at 02:13:26PM +0000, Lorenzo Stoakes wrote: > >> Userland library functions such as allocators and threading implementations > >> often require regions of memory to act as 'guard pages' - mappings which, > >> when accessed, result in a fatal signal being sent to the accessing > >> process. > >> > >> The current means by which these are implemented is via a PROT_NONE mmap() > >> mapping, which provides the required semantics however incur an overhead of > >> a VMA for each such region. > >> > >> With a great many processes and threads, this can rapidly add up and incur > >> a significant memory penalty. It also has the added problem of preventing > >> merges that might otherwise be permitted. > >> > >> This series takes a different approach - an idea suggested by Vlasimil > >> Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the > >> provenance becomes a little tricky to ascertain after this - please forgive > >> any omissions!) - rather than locating the guard pages at the VMA layer, > >> instead placing them in page tables mapping the required ranges. > >> > >> Early testing of the prototype version of this code suggests a 5 times > >> speed up in memory mapping invocations (in conjunction with use of > >> process_madvise()) and a 13% reduction in VMAs on an entirely idle android > >> system and unoptimised code. > >> > >> We expect with optimisation and a loaded system with a larger number of > >> guard pages this could significantly increase, but in any case these > >> numbers are encouraging. > >> > >> This way, rather than having separate VMAs specifying which parts of a > >> range are guard pages, instead we have a VMA spanning the entire range of > >> memory a user is permitted to access and including ranges which are to be > >> 'guarded'. > >> > >> After mapping this, a user can specify which parts of the range should > >> result in a fatal signal when accessed. > >> > >> By restricting the ability to specify guard pages to memory mapped by > >> existing VMAs, we can rely on the mappings being torn down when the > >> mappings are ultimately unmapped and everything works simply as if the > >> memory were not faulted in, from the point of view of the containing VMAs. > >> > >> This mechanism in effect poisons memory ranges similar to hardware memory > >> poisoning, only it is an entirely software-controlled form of poisoning. > >> > >> The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL > >> which installs page table-level guard page markers - and > >> MADV_GUARD_REMOVE - which clears them. > >> > >> Guard markers can be installed across multiple VMAs and any existing > >> mappings will be cleared, that is zapped, before installing the guard page > >> markers in the page tables. > >> > >> There is no concept of 'nested' guard markers, multiple attempts to install > >> guard markers in a range will, after the first attempt, have no effect. > >> > >> Importantly, removing guard markers over a range that contains both guard > >> markers and ordinary backed memory has no effect on anything but the guard > >> markers (including leaving huge pages un-split), so a user can safely > >> remove guard markers over a range of memory leaving the rest intact. > >> > >> The actual mechanism by which the page table entries are specified makes > >> use of existing logic - PTE markers, which are used for the userfaultfd > >> UFFDIO_POISON mechanism. > >> > >> Unfortunately PTE_MARKER_POISONED is not suited for the guard page > >> mechanism as it results in VM_FAULT_HWPOISON semantics in the fault > >> handler, so we add our own specific PTE_MARKER_GUARD and adapt existing > >> logic to handle it. > >> > >> We also extend the generic page walk mechanism to allow for installation of > >> PTEs (carefully restricted to memory management logic only to prevent > >> unwanted abuse). > >> > >> We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not > >> remove guard markers, nor does forking (except when VM_WIPEONFORK is > >> specified for a VMA which implies a total removal of memory > >> characteristics). > >> > >> It's important to note that the guard page implementation is emphatically > >> NOT a security feature, so a user can remove the markers if they wish. We > >> simply implement it in such a way as to provide the least surprising > >> behaviour. > >> > >> An extensive set of self-tests are provided which ensure behaviour is as > >> expected and additionally self-documents expected behaviour of guard > >> ranges. > > > > Dear Lorenzo, > > Dear colleagues, > > > > sorry about raising an old thread. > > > > It looks like this feature is now used in glibc [1]. And we noticed failures in CRIU [2] > > CI on Fedora Rawhide userspace. Now a question is how we can properly detect such > > "guarded" pages from user space. As I can see from MADV_GUARD_INSTALL implementation, > > it does not modify VMA flags anyhow, but only page tables. It means that /proc/<pid>/maps > > and /proc/<pid>/smaps interfaces are useless in this case. (Please, correct me if I'm missing > > anything here.) > > > > I wonder if you have any ideas / suggestions regarding Checkpoint/Restore here. We (CRIU devs) are happy > > to develop some patches to bring some uAPI to expose MADV_GUARDs, but before going into this we decided > > to raise this question in LKML. > > > See [1] and [2] Hi David, Huge thanks for such a fast and helpful reply ;) > > [1] > https://lkml.kernel.org/r/cover.1740139449.git.lorenzo.stoakes@xxxxxxxxxx > [2] https://lwn.net/Articles/1011366/ > > > -- > Cheers, > > David / dhildenb Kind regards, Alex >