Re: [bug report?] unintuitive behavior when mapping over hugepage-backed PROT_NONE regions

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Fri, 7 Feb 2025 13:12:33 +0000

Thanks very much for your report!

+cc other mmap maintainers/reviewers + Oscar (thank you for pinging me on
this Oscar!)

(you weren't to know as a bug reporter, so no problem, but the people
listed in the MEMORY MAPPING section in MAINTAINERS will always deal with
anything to do with mmap() and friends swiftly so it's a good resource
going forward if you find anything else :)

On Wed, Feb 05, 2025 at 11:18:34PM -0700, Uday Shankar wrote:
> I was debugging an issue with a malloc implementation when I noticed
> some unintuitive behavior that happens when someone attempts to
> overwrite part of a hugepage-backed PROT_NONE mapping with another
> mapping. I've isolated the issue and reproduced it with the following
> program:
>
> [root@localhost ~]# cat test.c
> #include <assert.h>
> #include <errno.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <unistd.h>
>
> #define MMAP_FLAGS_COMMON (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
>
> int main()
> {
>         size_t len = 2ULL << 30;
>         void *a = mmap(
>                 (void *)0x7c8000000000, len, PROT_NONE,
>                 MMAP_FLAGS_COMMON | MAP_FIXED_NOREPLACE | MAP_NORESERVE, -1, 0);

The MAP_NORESERVE is key here. This tells hugetlb that you do not want to
reserve pages, but rather to try at fault time, so it will make this
succeed even if you have no hugetlb pages.

If you try to then fault anywhere in the range, it'll segfault.

This check is done in hugetlb_reserve_pages() which is called by
hugetlb_file_mmap() which is an f_op->mmap() callback (the MAP_HUGETLB
'cheats in' an fd so makes it a file-backed mapping despite apperances):

	/*
	 * Only apply hugepage reservation if asked. At fault time, an
	 * attempt will be made for VM_NORESERVE to allocate a page
	 * without using reserves
	 */
	if (vm_flags & VM_NORESERVE)
		return true;

Actually though this is irrelevant - you will see the _same_ behaviour
regardless of how you map this.

>         printf("a=%p errno %d %m\n", a, errno);
>         errno = 0;
>
>         char buf[128];
>         sprintf(buf, "cp /proc/%d/smaps smaps1", getpid());
>         assert(system(buf) == 0);
>
>         len = 4096;
>         void *b = mmap(
>                 a, len, PROT_READ | PROT_WRITE,
>                 MMAP_FLAGS_COMMON | MAP_POPULATE | MAP_FIXED, -1, 0);

Here is the more interesting bit. Let me analyse what's happened here below.

>         printf("b=%p errno %d %m\n", b, errno);
>         errno = 0;
>
>         sprintf(buf, "cp /proc/%d/smaps smaps2", getpid());
>         assert(system(buf) == 0);
>
>         return 0;
> }
> [root@localhost ~]# gcc -o test test.c && ./test
> a=0x7c8000000000 errno 0 Success
> b=0xffffffffffffffff errno 12 Cannot allocate memory
> [root@localhost ~]# diff smaps1 smaps2
> 157,158c157,158
> < 7c8000000000-7c8080000000 ---p 00000000 00:10 7332                       /anon_hugepage (deleted)
> < Size:            2097152 kB
> ---
> > 7c8000200000-7c8080000000 ---p 00200000 00:10 7332                       /anon_hugepage (deleted)
> > Size:            2095104 kB
>
> First, we map a 2G PROT_NONE region using hugepages. This succeeds. Then
> we try to map a 4096-length PROT_READ | PROT_WRITE region at the
> beginning of the PROT_NONE region, still using hugepages. This fails, as
> expected, because 4096 is much smaller than the hugepage size configured
> on the system (this is x86 with a default hugepage size of 2M). The
> surprising thing is the difference in /proc/pid/smaps before and after
> the failed mmap. Even though the mmap failed, the value in
> /proc/pid/smaps changed, with a 2M-sized bite being taken out the front
> of the mapping. This feels unintuitive to me, as I'd expect a failed
> mmap to have no effect on the virtual memory mappings of the calling
> process whatsoever.
>
> I initially saw this on an ancient redhat kernel, but I was able to
> reproduce it on 6.13 as well. So I assume this behavior still exists and
> has been around forever.
>

As Oscar pointed out, this is not really hugely _surprising_ in that you
may be requesting a hugetlb mapping of 4096 bytes for case 'b' but by
setting MAP_HUGETLB you are asking for it to be aligned.

So in effect you are doing:

1. Perform 2 GiB mapping (it actually doesn't matter if you make it
   MAP_HUGETLB with MAP_NORESERVE to make that succeed or not, or even the
   size as long as it is larger than the mapping you make afterwards for
   the purposes of the discussion :)
2. Map something over this _partially_ where the mapping fails.

So the crux of the point is something that Joern alluded to in his reply -
'should a failed memory mapping operation cause visible changes to the
memory mapping'.

The key point here is that the operation you are performing is
_aggregate_. You're not asking for one thing to happen, you're actually
asking for 2 - unmap the start of range a THEN perform a new mapping in its
place.

So the question boils down to 'does linux guarantee that aggregate
operations, on partial failure, unwind those partial operations which
succeed?' - and emphatically the answer to that is no.

If one of those successful operations is 'unmap this mapping', as it is
here - you have to ask yourself - what would undoing that look like?
Because if there's anon backing, then the data underlying that mapping is
gone forever, so we can't restore the mapping as it was.

And at this point it really makes no sense to special case things like
PROT_NONE, because we simply do not guarantee that we do anything good in
this scenario.

Attempting to 'do everything up front' would be prohibitively expensive,
imagine the overhead in tracking that etc. etc.

If you get an error performing an operation, you must interpret this as
'the operation either partially or entirely failed' and act accordingly.

This may be surprising, but as you can imagine, correctly performing
unwinding of partial operations is... difficult and in some cases would
have significant performance penalties.

An important point here esp. as to Joern's question re: an unmapping
partially failing - since now we gather up and do unmappings on MAP_FIXED
in aggregate _internal to the kernel_ - if the failure had occurred there
(it didn't, it occurred in the mapping bit) - we WOULD HAVE undone the
partial operation.

The problem here is that we have SUCCESSFULLY unmapped part of a mapping,
then FAILED to map something in that gap.

So TL;DR is - aggregate operations failing means any or all of the
operation failed, you can no longer rely on the mapping state being what
you expected.

~~ Beyond here lies extraneous details for the curious... ~~

HOWEVER, interestingly, there is a little more to this story in this
particular case - Liam, recently significantly altered how MAP_FIXED
behaves (actually interestingly ultimately to allow for far more efficient
interaction with /proc/$pid/maps and friends but that's another story).

And he _tried_ very hard to implement the ability to partially unwind these
unmap operations, and this was part of 6.12... but in this instance there's
just nothing we can reasonably, as discussed above.

Now - _had the failure happened on the unmap bit_ - we WOULD be able to
unwind, by simply reattaching the existing mappings (that is - internal to
the kernel - VMAs).

HOWEVER, the failure here happens _after_ the unmap SUCCESSFULLY completes.

We track the unmapping operation with a 'vms' object, and key to this is a
field 'clear_ptes' which tracks whether we've cleaned up both page table
mappings and actually executed on the unmapping. If true, we haven't done
this yet, if false we have.

We set this false after succeeding at the unmap here:

vms_complete_munmap_vmas()
-> vms_clear_ptes()
-> [ clears down any actual page table mappings ]
-> [ sets vms->clear_ptes = false ]

And then on the new mapping we do:

__mmap_region()
-> __mmap_new_vma()
-> [ error handling ]
-> vms_abort_munmap_vmas()
-> [ sees vms->clear_ptes is false ]
-> clears the range and gives up.

Looking at the code:

static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
		struct ma_state *mas_detach)
{
	struct ma_state *mas = &vms->vmi->mas;

	if (!vms->nr_pages)
		return;

	if (vms->clear_ptes)
		return reattach_vmas(mas_detach);

^--- here is where, if the operation failed during unmap, we'd be able to
     reattach the VMAs and go about our business.

But this is not the case.

	/*
	 * Aborting cannot just call the vm_ops open() because they are often
	 * not symmetrical and state data has been lost.  Resort to the old
	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
	 */
	mas_set_range(mas, vms->start, vms->end - 1);

^--- This comment is the pertinent one - we have no choice but to do
     so. And this is what is happening now.

	mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
	/* Clean up the insertion of the unfortunate gap */
	vms_complete_munmap_vmas(vms, mas_detach);
}

Prior to Liam's changes, there was no attempt to reattach _at all_ and we
just always added this gap. Now we attempt to reattach _if it makes sense
to do so_ but the case you've encountered both previously and currently
will result in the behaviour you've observed.

Hopefully this clarifies things somewhat... :) it's a (surprisingly) tricky
area when it comes to corner case stuff like this.