Re: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Wed, 19 Feb 2025 09:15:00 +0000

On Wed, Feb 19, 2025 at 12:35:12AM -0800, Kalesh Singh wrote:
> On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@xxxxxxxxxx> wrote:
> >
> > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > >
> > > The guard regions feature was initially implemented to support anonymous
> > > mappings only, excluding shmem.
> > >
> > > This was done such as to introduce the feature carefully and incrementally
> > > and to be conservative when considering the various caveats and corner
> > > cases that are applicable to file-backed mappings but not to anonymous
> > > ones.
> > >
> > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > extend this functionality to file-backed and shmem mappings.
> > >
> > > In order to make this maximally useful, and since one may map file-backed
> > > mappings read-only (for instance ELF images), we also remove the
> > > restriction on read-only mappings and permit the establishment of guard
> > > regions in any non-hugetlb, non-mlock()'d mapping.
> >
> > Hi Lorenzo,
> >
> > Thank you for your work on this.
> >
> > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > In the field, I've found that many applications read the ranges from
> > /proc/self/[s]maps to determine what they can access (usually related
> > to obfuscation techniques). If they don't know of the guard regions it
> > would cause them to crash; I think that we'll need similar entries to
> > PROT_NONE (---p) for these, and generally to maintain consistency
> > between the behavior and what is being said from /proc/*/[s]maps.
>
> To clarify why the applications may not be aware of their guard
> regions -- in the case of the ELF mappings these PROT_NONE (guard
> regions) would be installed by the dynamic loader; or may be inherited
> from the parent (zygote in Android's case).

I'm still really confused as to how you propose to use guard regions in
this scenario. If it's trying to 'clone' some set of mappings somehow then
guard regions are just not compatible with your use case at all.

I'm not sure what would be honestly... you maybe could find a way to modify
the loader to interpret PROT_NONE mappings as a cue to use guard regions
instead if that worked?

You can't both not have a VMA (something that describes mapping with common
attributes) and have page table state along with somehow having something
that describes a mapping with said page table state :)

It's a sort of non-such-beast.

Everything is a trade-off, guard regions provide the option to retain state
regarding faulting behaviour in page tables rather the VMAs, entirely
opt-in and at the behest of the user.

Now extended to file-backed and shmem!

While I hoped it would help your scenario based on your talk at LPC, this
is very clearly an important change for guard regions (for shmem most
notably) and removes an important limitation and exposes the possibility of
using it for things such as ELF mappings for those who don't also require
specific [s]maps state (which explicitly NEEDS VMA metadata state).

I suspect you can find a way of making this work in your scenario
though.

An alternative if you're willing to wait for page table state to be
acquired is to expose a new /proc/$pid/... endpoint that explicitly outputs
guard regions.

Would this solve your problem...? I had always considered implementing this
if there was a demand.

Maybe /proc/$pid/gmaps? :P 'g' could mean guard regions or... ;)

>
> >
> > -- Kalesh
> >
> > >
> > > It is permissible to permit the establishment of guard regions in read-only
> > > mappings because the guard regions only reduce access to the mapping, and
> > > when removed simply reinstate the existing attributes of the underlying
> > > VMA, meaning no access violations can occur.
> > >
> > > While the change in kernel code introduced in this series is small, the
> > > majority of the effort here is spent in extending the testing to assert
> > > that the feature works correctly across numerous file-backed mapping
> > > scenarios.
> > >
> > > Every single guard region self-test performed against anonymous memory
> > > (which is relevant and not anon-only) has now been updated to also be
> > > performed against shmem and a mapping of a file in the working directory.
> > >
> > > This confirms that all cases also function correctly for file-backed guard
> > > regions.
> > >
> > > In addition a number of other tests are added for specific file-backed
> > > mapping scenarios.
> > >
> > > There are a number of other concerns that one might have with regard to
> > > guard regions, addressed below:
> > >
> > > Readahead
> > > ~~~~~~~~~
> > >
> > > Readahead is a process through which the page cache is populated on the
> > > assumption that sequential reads will occur, thus amortising I/O and,
> > > through a clever use of the PG_readahead folio flag establishing during
> > > major fault and checked upon minor fault, provides for asynchronous I/O to
> > > occur as dat is processed, reducing I/O stalls as data is faulted in.
> > >
> > > Guard regions do not alter this mechanism which operations at the folio and
> > > fault level, but do of course prevent the faulting of folios that would
> > > otherwise be mapped.
> > >
> > > In the instance of a major fault prior to a guard region, synchronous
> > > readahead will occur including populating folios in the page cache which
> > > the guard regions will, in the case of the mapping in question, prevent
> > > access to.
> > >
> > > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > > this will prevent asynchronous readahead from occurring as it would
> > > otherwise do.
> > >
> > > However, there are mechanisms for heuristically resetting this within
> > > readahead regardless, which will 'recover' correct readahead behaviour.
> > >
> > > Readahead presumes sequential data access, the presence of a guard region
> > > clearly indicates that, at least in the guard region, no such sequential
> > > access will occur, as it cannot occur there.
> > >
> > > So this should have very little impact on any real workload. The far more
> > > important point is as to whether readahead causes incorrect or
> > > inappropriate mapping of ranges disallowed by the presence of guard
> > > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > > this fashion.
> > >
> > > At any rate, any mechanism which would attempt to do so would hit the usual
> > > page fault paths, which correctly handle PTE markers as with anonymous
> > > mappings.
> > >
> > > Fault-Around
> > > ~~~~~~~~~~~~
> > >
> > > The fault-around logic, in a similar vein to readahead, attempts to improve
> > > efficiency with regard to file-backed memory mappings, however it differs
> > > in that it does not try to fetch folios into the page cache that are about
> > > to be accessed, but rather pre-maps a range of folios around the faulting
> > > address.
> > >
> > > Guard regions making use of PTE markers makes this relatively trivial, as
> > > this case is already handled - see filemap_map_folio_range() and
> > > filemap_map_order0_folio() - in both instances, the solution is to simply
> > > keep the established page table mappings and let the fault handler take
> > > care of PTE markers, as per the comment:
> > >
> > >         /*
> > >          * NOTE: If there're PTE markers, we'll leave them to be
> > >          * handled in the specific fault path, and it'll prohibit
> > >          * the fault-around logic.
> > >          */
> > >
> > > This works, as establishing guard regions results in page table mappings
> > > with PTE markers, and clearing them removes them.
> > >
> > > Truncation
> > > ~~~~~~~~~~
> > >
> > > File truncation will not eliminate existing guard regions, as the
> > > truncation operation will ultimately zap the range via
> > > unmap_mapping_range(), which specifically excludes PTE markers.
> > >
> > > Zapping
> > > ~~~~~~~
> > >
> > > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > > which specifically deals with guard entries, leaving them intact except in
> > > instances such as process teardown or munmap() where they need to be
> > > removed.
> > >
> > > Reclaim
> > > ~~~~~~~
> > >
> > > When reclaim is performed on file-backed folios, it ultimately invokes
> > > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > > will ultimately abort the operation for the guard region mapping. If large,
> > > then check_pte() will determine that this is a non-device private
> > > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > > that instance.
> > >
> > > Therefore, no odd things happen in the instance of reclaim being attempted
> > > upon a file-backed guard region.
> > >
> > > Hole Punching
> > > ~~~~~~~~~~~~~
> > >
> > > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > > which explicitly leaves PTE markers in place.
> > >
> > > Because the establishment of guard regions zapped any existing mappings to
> > > file-backed folios, once the guard regions are removed then the
> > > hole-punched region will be faulted in as usual and everything will behave
> > > as expected.
> > >
> > > Lorenzo Stoakes (4):
> > >   mm: allow guard regions in file-backed and read-only mappings
> > >   selftests/mm: rename guard-pages to guard-regions
> > >   tools/selftests: expand all guard region tests to file-backed
> > >   tools/selftests: add file/shmem-backed mapping guard region tests
> > >
> > >  mm/madvise.c                                  |   8 +-
> > >  tools/testing/selftests/mm/.gitignore         |   2 +-
> > >  tools/testing/selftests/mm/Makefile           |   2 +-
> > >  .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
> > >  4 files changed, 821 insertions(+), 112 deletions(-)
> > >  rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> > >
> > > --
> > > 2.48.1