On Wed, Dec 02, 2020 at 12:17:31PM +0100, Peter Zijlstra wrote: > So the obvious 'improvement' here would be something like: > > for_each_online_cpu(cpu) { > p = rcu_dereference(cpu_rq(cpu)->curr; > if (p->active_mm != mm) > continue; > __cpumask_set_cpu(cpu, tmpmask); > } > on_each_cpu_mask(tmpmask, ...); > > The remote CPU will never switch _to_ @mm, on account of it being quite > dead, but it is quite prone to false negatives. > > Consider that __schedule() sets rq->curr *before* context_switch(), this > means we'll see next->active_mm, even though prev->active_mm might still > be our @mm. > > Now, because we'll be removing the atomic ops from context_switch()'s > active_mm swizzling, I think we can change this to something like the > below. The hope being that the cost of the new barrier can be offset by > the loss of the atomics. > > Hmm ? > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 41404afb7f4c..2597c5c0ccb0 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4509,7 +4509,6 @@ context_switch(struct rq *rq, struct task_struct *prev, > if (!next->mm) { // to kernel > enter_lazy_tlb(prev->active_mm, next); > > - next->active_mm = prev->active_mm; > if (prev->mm) // from user > mmgrab(prev->active_mm); > else > @@ -4524,6 +4523,7 @@ context_switch(struct rq *rq, struct task_struct *prev, > * case 'prev->active_mm == next->mm' through > * finish_task_switch()'s mmdrop(). > */ > + next->active_mm = next->mm; > switch_mm_irqs_off(prev->active_mm, next->mm, next); I think that next->active_mm store should be after switch_mm(), otherwise we still race. > > if (!prev->mm) { // from kernel > @@ -5713,11 +5713,9 @@ static void __sched notrace __schedule(bool preempt) > > if (likely(prev != next)) { > rq->nr_switches++; > - /* > - * RCU users of rcu_dereference(rq->curr) may not see > - * changes to task_struct made by pick_next_task(). > - */ > - RCU_INIT_POINTER(rq->curr, next); > + > + next->active_mm = prev->active_mm; > + rcu_assign_pointer(rq->curr, next); > /* > * The membarrier system call requires each architecture > * to have a full memory barrier after updating