Excerpts from peterz@xxxxxxxxxxxxx's message of August 21, 2020 11:04 pm: > On Fri, Aug 21, 2020 at 11:09:51AM +0530, Aneesh Kumar K.V wrote: >> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes: >> >> > For SMP systems using IPI based TLB invalidation, looking at >> > current->active_mm is entirely reasonable. This then presents the >> > following race condition: >> > >> > >> > CPU0 CPU1 >> > >> > flush_tlb_mm(mm) use_mm(mm) >> > <send-IPI> >> > tsk->active_mm = mm; >> > <IPI> >> > if (tsk->active_mm == mm) >> > // flush TLBs >> > </IPI> >> > switch_mm(old_mm,mm,tsk); >> > >> > >> > Where it is possible the IPI flushed the TLBs for @old_mm, not @mm, >> > because the IPI lands before we actually switched. >> > >> > Avoid this by disabling IRQs across changing ->active_mm and >> > switch_mm(). >> > >> > [ There are all sorts of reasons this might be harmless for various >> > architecture specific reasons, but best not leave the door open at >> > all. ] >> >> >> Do we have similar race with exec_mmap()? I am looking at exec_mmap() >> runnning parallel to do_exit_flush_lazy_tlb(). We can get >> >> if (current->active_mm == mm) { >> >> true and if we don't disable irq around updating tsk->mm/active_mm we >> can end up doing mmdrop on wrong mm? > > exec_mmap() is called after de_thread(), there should not be any mm > specific invalidations around I think. > > Then again, CLONE_VM without CLONE_THREAD might still be possible, so > yeah, we probably want IRQs disabled there too, just for consistency and > general paranoia if nothing else. The problem is probably not this TLB flushing race, but I think there is a lazy tlb race. call_usermodehelper() kernel_execve() old_mm = current->mm; active_mm = current->active_mm; *** preempt *** ---------------------->schedule() prev->active_mm = NULL; mmdrop(prev active mm) ... <----------------------schedule() current->mm = mm; current->active_mm = mm; if (!old_mm) mmdrop(active_mm); /* double free! */ There's possibly other problematic interleavings. powerpc also has an issue with switching away a lazy tlb mm via IPI which is basically the same problem so I just illustrate the more general issue. I think we just make it a rule that these always get updated under local_irq_disable, to be safe. Trouble is we can't just do it, because some architectures can't do activate_mm with irqs disabled. ARM and UM, at least. UM can't even do preempt_disabled. We can probably change them to make them work, I'm not sure what the best way to go is, my first attempt is to require activate_mm to do the mm switching and the irq disable as well, but I'll need some help from the archs I'll send out rfcs in a minute. Thanks, Nick