On Fri, Jul 2, 2021, at 4:51 AM, Jann Horn wrote: > On Fri, Jul 2, 2021 at 8:25 AM Andrei Vagin <avagin@xxxxxxxxx> wrote: > > On Mon, Jun 28, 2021 at 06:13:29PM +0200, Jann Horn wrote: > > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <avagin@xxxxxxxxx> wrote: > > > > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm) > > > > +{ > > > > + struct task_struct *tsk = current; > > > > + struct mm_struct *active_mm; > > > > + > > > > + task_lock(tsk); > > > > + /* Hold off tlb flush IPIs while switching mm's */ > > > > + local_irq_disable(); > > > > + > > > > + sync_mm_rss(prev_mm); > > > > + > > > > + vmacache_flush(tsk); > > > > + > > > > + active_mm = tsk->active_mm; > > > > + if (active_mm != target_mm) { > > > > + mmgrab(target_mm); > > > > + tsk->active_mm = target_mm; > > > > + } > > > > + tsk->mm = target_mm; > > > > > > I'm pretty sure you're not currently allowed to overwrite the ->mm > > > pointer of a userspace thread. For example, zap_threads() assumes that > > > all threads running under a process have the same ->mm. (And if you're > > > fiddling with ->mm stuff, you should probably CC linux-mm@.) > > > > > > As far as I understand, only kthreads are allowed to do this (as > > > implemented in kthread_use_mm()). > > > > kthread_use_mm() was renamed from use_mm in the v5.8 kernel. Before > > that, it wasn't used for user processes in the kernel, but it was > > exported for modules, and we used it without any visible problems. We > > understood that there could be some issues like zap_threads and it was > > one of reasons why we decided to introduce this system call. > > > > I understand that there are no places in the kernel where we change mm > > of user threads back and forth, but are there any real concerns why we > > should not do that? I agree that zap_threads should be fixed, but it > > will the easy one. > > My point is that if you break a preexisting assumption like this, > you'll have to go through the kernel and search for places that rely > on this assumption, and fix them up, which may potentially require > thinking about what kinds of semantics would actually be appropriate > there. Like the MCE killing logic (collect_procs_anon() and such). And > current_is_single_threaded(), in which the current patch probably > leads to logic security bugs. And __uprobe_perf_filter(). Before my > refactoring of the ELF coredump logic in kernel 5.10 (commit > b2767d97f5ff75 and the ones before it), you'd have also probably > created memory corruption bugs in races between elf_core_dump() and > syscalls like mmap()/munmap(). (Note that this is not necessarily an > exhaustive list.) > There’s nmi_uaccess_okay(), and its callers assume that, when a task is perf tracing itself, that an event on that task with nmi_uaccess_okay() means that uaccess will access that task’s memory. Core dump code probably expects that dumping memory will access the correct mm. I cannot fathom why any kind of remote vm access touched FPU state at all. What PKRU value is supposed to be used when doing mm swap shenanigans? How about PASID? What happens if one task attempts to issue a KVM ioctl while its mm is swapped?