On Thu, Oct 17, 2024 at 11:47 AM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > On Wed, Oct 16, 2024 at 05:07:53PM +0200, Jann Horn wrote: > > vms_abort_munmap_vmas() is a recovery path where, on entry, some VMAs > > have already been torn down halfway (in a way we can't undo) but are > > still present in the maple tree. > > > > At this point, we *must* remove the VMAs from the VMA tree, otherwise > > we get UAF. > > > > Because removing VMA tree nodes can require memory allocation, the > > existing code has an error path which tries to handle this by > > reattaching the VMAs; but that can't be done safely. > > > > A nicer way to fix it would probably be to preallocate enough maple > > tree nodes for the removal before the point of no return, or something > > like that; but for now, fix it the easy and kinda ugly way, by marking > > this allocation __GFP_NOFAIL. > > > > Fixes: 4f87153e82c4 ("mm: change failure of MAP_FIXED to restoring the gap on failure") > > Signed-off-by: Jann Horn <jannh@xxxxxxxxxx> > > I kind of question whether this is real-world achievable (yes I realise you > included a repro, but one prodding /sys/kernel/debug bits :>) but to be > honest at this point I think I feel a lot safer just clearing this here for > sure. So: I mean, there is a reason why we have __GFP_NOFAIL, and if you don't set it, my understanding is that you *can* end up failing allocations when the page allocator sees no other way to make progress... I think as a rough sketch, what you'd have to do to hit this issue without cheating using fault injection might be something like this, for simplicity assume all of this happens on the same CPU core: - make processes A, B, C, D; with A having threads A1 and A2 - let process A consume most of the available RAM+swap (so that process A will be killed first by the OOM killer) - let thread A2 enter some syscall that will allocate a lot of order-0 pages without fatal_signal_pending() checks, then block/preempt it somehow - let thread A1 enter an mmap() syscall, then block/preempt it somehow - let process B consume remaining available RAM, until B blocks and the OOM killer decides to reap process A. Note that the OOM killer starts by basically just setting a flag on the target process and sending it a fatal signal; only if the target process doesn't exit for some time after that (OOM_REAPER_DELAY = 2 seconds), the OOM killer starts actively reaping the target's memory - let process C allocate as many maple tree nodes as possible (to drain the slab cache's freelists), until C blocks on memory allocation - maybe let process D free one maple tree node or such, so that the first maple node allocation in mmap() for constructing the detached tree works? - let thread A2 continue - it will have access to ALLOC_OOM memory reserves, and AFAIU will be able to completely empty out the memory reserves, and will then hit a __GFP_KERNEL allocation failure - once A2 has hit an allocation failure, let thread A1 continue execution - it, too, should hit a __GFP_KERNEL allocation failure But I haven't actually tested that.