On Mon, Jan 31, 2022 at 5:35 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Mon, Jan 31, 2022 at 10:26:11AM -0600, Eric W. Biederman wrote: > > Matthew Wilcox <willy@xxxxxxxxxxxxx> writes: > > > > > On Mon, Jan 31, 2022 at 10:03:31AM -0600, Eric W. Biederman wrote: > > >> "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> writes: > > >> > > >> > I'm not sure if the VMA list can change under us, but dump_vma_snapshot() > > >> > is very careful to take the mmap_lock in write mode. We only need to > > >> > take it in read mode here as we do not care if the size of the stack > > >> > VMA changes underneath us. > > >> > > > >> > If it can be changed underneath us, this is a potential use-after-free > > >> > for a multithreaded process which is dumping core. > > >> > > >> The problem is not multi-threaded process so much as processes that > > >> share their mm. > > > > > > I don't understand the difference. I appreciate that another process can > > > get read access to an mm through, eg, /proc, but how can another process > > > (that isn't a thread of this process) modify the VMAs? > > > > There are a couple of ways. > > > > A classic way is a multi-threads process can call vfork, and the > > mm_struct is shared with the child until exec is called. > > While true, I thought the semantics of vfork() were that the parent > was suspended. Given that, it can't core dump until the child execs > ... right? > > > A process can do this more deliberately by forking a child using > > clone(CLONE_VM) and not including CLONE_THREAD. Supporting this case > > is a hold over from before CLONE_THREAD was supported in the kernel and > > such processes were used to simulate threads. The syscall clone() is kind of the generalized version of fork() and vfork(), and it lets you create fun combinations of these things (some of which might be useful, others which make little sense), and e.g. vfork() is basically just clone() with CLONE_VM (for sharing address space) plus CLONE_VFORK (block until child exits or execs) plus SIGCHLD (child should send SIGCHLD to parent when it terminates). (Some combinations are disallowed, but you can IIRC make things like threads with separate FD tables, or processes that share virtual memory and signal handler tables but aren't actual threads.) Note that until commit 0258b5fd7c71 ("coredump: Limit coredumps to a single thread group", first in 5.16), coredumps would always tear down every process that shares the MM before dumping, and the coredumping code was trying to rely on that to protect against concurrency. (Which at some point didn't actually work anymore, see below...) > That is a multithreaded process then! Maybe not in the strict POSIX > compliance sense, but the intent is to be a multithreaded process. > ie multiple threads of execution, sharing an address space. current_is_single_threaded() agrees with you. :P > > It also happens that there are subsystems in the kernel that do things > > like kthread_use_mm that can also be modifying the mm during a coredump. > > Yikes. That's terrifying. It's really legitimate for a kthread to > attach to a process and start tearing down VMAs? I don't know anything about kthreads doing this, but a fun way to remotely mess with VMAs is userfaultfd. See https://bugs.chromium.org/p/project-zero/issues/detail?id=1790 ("Issue 1790: Linux: missing locking between ELF coredump code and userfaultfd VMA modification") for the long version - but the gist is that userfaultfd can set/clear flags on virtual address ranges (implemented as flags on VMAs), and that may involve splitting VMAs (when the boundaries of the specified range don't correspond to existing VMA boundaries) or merging VMAs (when the flags of adjacent VMAs become equal). And userfaultfd can by design be used remotely on another process (so long as that process first created the userfaultfd and then handed it over). That's why I added that locking in the coredumping code. > > > Uhh .. that seems like it needs a lot more understanding of binfmt_elf > > > than I currently possess. I'd rather spend my time working on folios > > > than learning much more about binfmt_elf. I was just trying to fix an > > > assertion failure with the maple tree patches (we now assert that you're > > > holding a lock when walking the list of VMAs). > > > > Fair enough. I will put it on my list of things to address. > > Thanks. Now that I've disclosed it's a UAF, I hope you're able to > get to it soon. Otherwise we should put this band-aid in for now > and you can address it properly in the fullness of time.