Re: [PATCH fix 6.12] mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAIL

Jann Horn <jannh@xxxxxxxxxx> · Thu, 17 Oct 2024 18:57:24 +0200

On Thu, Oct 17, 2024 at 11:47 AM Lorenzo Stoakes
<lorenzo.stoakes@xxxxxxxxxx> wrote:
> On Wed, Oct 16, 2024 at 05:07:53PM +0200, Jann Horn wrote:
> > vms_abort_munmap_vmas() is a recovery path where, on entry, some VMAs
> > have already been torn down halfway (in a way we can't undo) but are
> > still present in the maple tree.
> >
> > At this point, we *must* remove the VMAs from the VMA tree, otherwise
> > we get UAF.
> >
> > Because removing VMA tree nodes can require memory allocation, the
> > existing code has an error path which tries to handle this by
> > reattaching the VMAs; but that can't be done safely.
> >
> > A nicer way to fix it would probably be to preallocate enough maple
> > tree nodes for the removal before the point of no return, or something
> > like that; but for now, fix it the easy and kinda ugly way, by marking
> > this allocation __GFP_NOFAIL.
> >
> > Fixes: 4f87153e82c4 ("mm: change failure of MAP_FIXED to restoring the gap on failure")
> > Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>
>
> I kind of question whether this is real-world achievable (yes I realise you
> included a repro, but one prodding /sys/kernel/debug bits :>) but to be
> honest at this point I think I feel a lot safer just clearing this here for
> sure. So:

I mean, there is a reason why we have __GFP_NOFAIL, and if you don't
set it, my understanding is that you *can* end up failing allocations
when the page allocator sees no other way to make progress...

I think as a rough sketch, what you'd have to do to hit this issue
without cheating using fault injection might be something like this,
for simplicity assume all of this happens on the same CPU core:

 - make processes A, B, C, D; with A having threads A1 and A2
 - let process A consume most of the available RAM+swap (so that
process A will be killed first by the OOM killer)
 - let thread A2 enter some syscall that will allocate a lot of
order-0 pages without fatal_signal_pending() checks, then
block/preempt it somehow
 - let thread A1 enter an mmap() syscall, then block/preempt it somehow
 - let process B consume remaining available RAM, until B blocks and
the OOM killer decides to reap process A. Note that the OOM killer
starts by basically just setting a flag on the target process and
sending it a fatal signal; only if the target process doesn't exit for
some time after that (OOM_REAPER_DELAY = 2 seconds), the OOM killer
starts actively reaping the target's memory
 - let process C allocate as many maple tree nodes as possible (to
drain the slab cache's freelists), until C blocks on memory allocation
 - maybe let process D free one maple tree node or such, so that the
first maple node allocation in mmap() for constructing the detached
tree works?
 - let thread A2 continue - it will have access to ALLOC_OOM memory
reserves, and AFAIU will be able to completely empty out the memory
reserves, and will then hit a __GFP_KERNEL allocation failure
 - once A2 has hit an allocation failure, let thread A1 continue
execution - it, too, should hit a __GFP_KERNEL allocation failure

But I haven't actually tested that.