Re: [PATCH v4 0/5] implement lightweight guard pages

Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> · Wed, 19 Mar 2025 16:08:51 +0100

On Wed, Mar 19, 2025 at 3:53 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 19.03.25 15:50, Alexander Mikhalitsyn wrote:
> > On Mon, Oct 28, 2024 at 02:13:26PM +0000, Lorenzo Stoakes wrote:
> >> Userland library functions such as allocators and threading implementations
> >> often require regions of memory to act as 'guard pages' - mappings which,
> >> when accessed, result in a fatal signal being sent to the accessing
> >> process.
> >>
> >> The current means by which these are implemented is via a PROT_NONE mmap()
> >> mapping, which provides the required semantics however incur an overhead of
> >> a VMA for each such region.
> >>
> >> With a great many processes and threads, this can rapidly add up and incur
> >> a significant memory penalty. It also has the added problem of preventing
> >> merges that might otherwise be permitted.
> >>
> >> This series takes a different approach - an idea suggested by Vlasimil
> >> Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> >> provenance becomes a little tricky to ascertain after this - please forgive
> >> any omissions!)  - rather than locating the guard pages at the VMA layer,
> >> instead placing them in page tables mapping the required ranges.
> >>
> >> Early testing of the prototype version of this code suggests a 5 times
> >> speed up in memory mapping invocations (in conjunction with use of
> >> process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> >> system and unoptimised code.
> >>
> >> We expect with optimisation and a loaded system with a larger number of
> >> guard pages this could significantly increase, but in any case these
> >> numbers are encouraging.
> >>
> >> This way, rather than having separate VMAs specifying which parts of a
> >> range are guard pages, instead we have a VMA spanning the entire range of
> >> memory a user is permitted to access and including ranges which are to be
> >> 'guarded'.
> >>
> >> After mapping this, a user can specify which parts of the range should
> >> result in a fatal signal when accessed.
> >>
> >> By restricting the ability to specify guard pages to memory mapped by
> >> existing VMAs, we can rely on the mappings being torn down when the
> >> mappings are ultimately unmapped and everything works simply as if the
> >> memory were not faulted in, from the point of view of the containing VMAs.
> >>
> >> This mechanism in effect poisons memory ranges similar to hardware memory
> >> poisoning, only it is an entirely software-controlled form of poisoning.
> >>
> >> The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
> >> which installs page table-level guard page markers - and
> >> MADV_GUARD_REMOVE - which clears them.
> >>
> >> Guard markers can be installed across multiple VMAs and any existing
> >> mappings will be cleared, that is zapped, before installing the guard page
> >> markers in the page tables.
> >>
> >> There is no concept of 'nested' guard markers, multiple attempts to install
> >> guard markers in a range will, after the first attempt, have no effect.
> >>
> >> Importantly, removing guard markers over a range that contains both guard
> >> markers and ordinary backed memory has no effect on anything but the guard
> >> markers (including leaving huge pages un-split), so a user can safely
> >> remove guard markers over a range of memory leaving the rest intact.
> >>
> >> The actual mechanism by which the page table entries are specified makes
> >> use of existing logic - PTE markers, which are used for the userfaultfd
> >> UFFDIO_POISON mechanism.
> >>
> >> Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> >> mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> >> handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> >> logic to handle it.
> >>
> >> We also extend the generic page walk mechanism to allow for installation of
> >> PTEs (carefully restricted to memory management logic only to prevent
> >> unwanted abuse).
> >>
> >> We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
> >> remove guard markers, nor does forking (except when VM_WIPEONFORK is
> >> specified for a VMA which implies a total removal of memory
> >> characteristics).
> >>
> >> It's important to note that the guard page implementation is emphatically
> >> NOT a security feature, so a user can remove the markers if they wish. We
> >> simply implement it in such a way as to provide the least surprising
> >> behaviour.
> >>
> >> An extensive set of self-tests are provided which ensure behaviour is as
> >> expected and additionally self-documents expected behaviour of guard
> >> ranges.
> >
> > Dear Lorenzo,
> > Dear colleagues,
> >
> > sorry about raising an old thread.
> >
> > It looks like this feature is now used in glibc [1]. And we noticed failures in CRIU [2]
> > CI on Fedora Rawhide userspace. Now a question is how we can properly detect such
> > "guarded" pages from user space. As I can see from MADV_GUARD_INSTALL implementation,
> > it does not modify VMA flags anyhow, but only page tables. It means that /proc/<pid>/maps
> > and /proc/<pid>/smaps interfaces are useless in this case. (Please, correct me if I'm missing
> > anything here.)
> >
> > I wonder if you have any ideas / suggestions regarding Checkpoint/Restore here. We (CRIU devs) are happy
> > to develop some patches to bring some uAPI to expose MADV_GUARDs, but before going into this we decided
> > to raise this question in LKML.
>
>
> See [1] and [2]

Hi David,

Huge thanks for such a fast and helpful reply ;)

>
> [1]
> https://lkml.kernel.org/r/cover.1740139449.git.lorenzo.stoakes@xxxxxxxxxx
> [2] https://lwn.net/Articles/1011366/
>
>
> --
> Cheers,
>
> David / dhildenb

Kind regards,
Alex

>