Re: [PATCH] binfmt_elf: Take the mmap lock when walking the VMA list

Jann Horn <jannh@xxxxxxxxxx> · Mon, 31 Jan 2022 18:13:42 +0100

On Mon, Jan 31, 2022 at 5:35 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jan 31, 2022 at 10:26:11AM -0600, Eric W. Biederman wrote:
> > Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:
> >
> > > On Mon, Jan 31, 2022 at 10:03:31AM -0600, Eric W. Biederman wrote:
> > >> "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> writes:
> > >>
> > >> > I'm not sure if the VMA list can change under us, but dump_vma_snapshot()
> > >> > is very careful to take the mmap_lock in write mode.  We only need to
> > >> > take it in read mode here as we do not care if the size of the stack
> > >> > VMA changes underneath us.
> > >> >
> > >> > If it can be changed underneath us, this is a potential use-after-free
> > >> > for a multithreaded process which is dumping core.
> > >>
> > >> The problem is not multi-threaded process so much as processes that
> > >> share their mm.
> > >
> > > I don't understand the difference.  I appreciate that another process can
> > > get read access to an mm through, eg, /proc, but how can another process
> > > (that isn't a thread of this process) modify the VMAs?
> >
> > There are a couple of ways.
> >
> > A classic way is a multi-threads process can call vfork, and the
> > mm_struct is shared with the child until exec is called.
>
> While true, I thought the semantics of vfork() were that the parent
> was suspended.  Given that, it can't core dump until the child execs
> ... right?
>
> > A process can do this more deliberately by forking a child using
> > clone(CLONE_VM) and not including CLONE_THREAD.   Supporting this case
> > is a hold over from before CLONE_THREAD was supported in the kernel and
> > such processes were used to simulate threads.

The syscall clone() is kind of the generalized version of fork() and
vfork(), and it lets you create fun combinations of these things (some
of which might be useful, others which make little sense), and e.g.
vfork() is basically just clone() with CLONE_VM (for sharing address
space) plus CLONE_VFORK (block until child exits or execs) plus
SIGCHLD (child should send SIGCHLD to parent when it terminates).

(Some combinations are disallowed, but you can IIRC make things like
threads with separate FD tables, or processes that share virtual
memory and signal handler tables but aren't actual threads.)

Note that until commit 0258b5fd7c71 ("coredump: Limit coredumps to a
single thread group", first in 5.16), coredumps would always tear down
every process that shares the MM before dumping, and the coredumping
code was trying to rely on that to protect against concurrency. (Which
at some point didn't actually work anymore, see below...)

> That is a multithreaded process then!  Maybe not in the strict POSIX
> compliance sense, but the intent is to be a multithreaded process.
> ie multiple threads of execution, sharing an address space.

current_is_single_threaded() agrees with you. :P

> > It also happens that there are subsystems in the kernel that do things
> > like kthread_use_mm that can also be modifying the mm during a coredump.
>
> Yikes.  That's terrifying.  It's really legitimate for a kthread to
> attach to a process and start tearing down VMAs?

I don't know anything about kthreads doing this, but a fun way to
remotely mess with VMAs is userfaultfd. See
https://bugs.chromium.org/p/project-zero/issues/detail?id=1790 ("Issue
1790: Linux: missing locking between ELF coredump code and userfaultfd
VMA modification") for the long version - but the gist is that
userfaultfd can set/clear flags on virtual address ranges (implemented
as flags on VMAs), and that may involve splitting VMAs (when the
boundaries of the specified range don't correspond to existing VMA
boundaries) or merging VMAs (when the flags of adjacent VMAs become
equal). And userfaultfd can by design be used remotely on another
process (so long as that process first created the userfaultfd and
then handed it over).

That's why I added that locking in the coredumping code.

> > > Uhh .. that seems like it needs a lot more understanding of binfmt_elf
> > > than I currently possess.  I'd rather spend my time working on folios
> > > than learning much more about binfmt_elf.  I was just trying to fix an
> > > assertion failure with the maple tree patches (we now assert that you're
> > > holding a lock when walking the list of VMAs).
> >
> > Fair enough.  I will put it on my list of things to address.
>
> Thanks.  Now that I've disclosed it's a UAF, I hope you're able to
> get to it soon.  Otherwise we should put this band-aid in for now
> and you can address it properly in the fullness of time.