----- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski luto@xxxxxxxxxx wrote: > membarrier() needs a barrier after any CPU changes mm. There is currently > a comment explaining why this barrier probably exists in all cases. The > logic is based on ensuring that the barrier exists on every control flow > path through the scheduler. It also relies on mmgrab() and mmdrop() being > full barriers. > > mmgrab() and mmdrop() would be better if they were not full barriers. As a > trivial optimization, mmgrab() could use a relaxed atomic and mmdrop() > could use a release on architectures that have these operations. Larger > optimizations are also in the works. Doing any of these optimizations > while preserving an unnecessary barrier will complicate the code and > penalize non-membarrier-using tasks. > > Simplify the logic by adding an explicit barrier, and allow architectures > to override it as an optimization if they want to. > > One of the deleted comments in this patch said "It is therefore > possible to schedule between user->kernel->user threads without > passing through switch_mm()". It is possible to do this without, say, > writing to CR3 on x86, but the core scheduler indeed calls > switch_mm_irqs_off() to tell the arch code to go back from lazy mode > to no-lazy mode. > > The membarrier_finish_switch_mm() call in exec_mmap() is a no-op so long as > there is no way for a newly execed program to register for membarrier prior > to running user code. Subsequent patches will merge the exec_mmap() code > with the kthread_use_mm() code, though, and keeping the paths consistent > will make the result more comprehensible. I don't agree with the approach here. Adding additional memory barrier overhead for the sake of possible future optimization work is IMHO not an appropriate justification for a performance regression. However I think we can manage to allow forward progress on optimization of mmgrab/mmdrop without hurting performance. One possible approach would be to introduce smp_mb__{before,after}_{mmgrab,mmdrop}, which would initially be no-ops. Those could be used by membarrier without introducing any overhead, and would allow to gradually move the implicit barriers from mmgrab/mmdrop to smp_mb__{before,after}_{mmgrab,mmdrop} on a per-architecture basis. Thanks, Mathieu > > Cc: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > Cc: Nicholas Piggin <npiggin@xxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx> > --- > fs/exec.c | 1 + > include/linux/sched/mm.h | 18 ++++++++++++++++++ > kernel/kthread.c | 12 +----------- > kernel/sched/core.c | 34 +++++++++------------------------- > 4 files changed, 29 insertions(+), 36 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index a098c133d8d7..3abbd0294e73 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1019,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm) > activate_mm(active_mm, mm); > if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) > local_irq_enable(); > + membarrier_finish_switch_mm(mm); > tsk->mm->vmacache_seqnum = 0; > vmacache_flush(tsk); > task_unlock(tsk); > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 0df706c099e5..e8919995d8dd 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -349,6 +349,20 @@ extern void membarrier_exec_mmap(struct mm_struct *mm); > > extern void membarrier_update_current_mm(struct mm_struct *next_mm); > > +/* > + * Called by the core scheduler after calling switch_mm_irqs_off(). > + * Architectures that have implicit barriers when switching mms can > + * override this as an optimization. > + */ > +#ifndef membarrier_finish_switch_mm > +static inline void membarrier_finish_switch_mm(struct mm_struct *mm) > +{ > + if (atomic_read(&mm->membarrier_state) & > + (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED)) > + smp_mb(); > +} > +#endif > + > #else > static inline void membarrier_exec_mmap(struct mm_struct *mm) > { > @@ -356,6 +370,10 @@ static inline void membarrier_exec_mmap(struct mm_struct > *mm) > static inline void membarrier_update_current_mm(struct mm_struct *next_mm) > { > } > +static inline void membarrier_finish_switch_mm(struct mm_struct *mm) > +{ > +} > + > #endif > > #endif /* _LINUX_SCHED_MM_H */ > diff --git a/kernel/kthread.c b/kernel/kthread.c > index 5b37a8567168..396ae78a1a34 100644 > --- a/kernel/kthread.c > +++ b/kernel/kthread.c > @@ -1361,25 +1361,15 @@ void kthread_use_mm(struct mm_struct *mm) > tsk->mm = mm; > membarrier_update_current_mm(mm); > switch_mm_irqs_off(active_mm, mm, tsk); > + membarrier_finish_switch_mm(mm); > local_irq_enable(); > task_unlock(tsk); > #ifdef finish_arch_post_lock_switch > finish_arch_post_lock_switch(); > #endif > > - /* > - * When a kthread starts operating on an address space, the loop > - * in membarrier_{private,global}_expedited() may not observe > - * that tsk->mm, and not issue an IPI. Membarrier requires a > - * memory barrier after storing to tsk->mm, before accessing > - * user-space memory. A full memory barrier for membarrier > - * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by > - * mmdrop(), or explicitly with smp_mb(). > - */ > if (active_mm != mm) > mmdrop(active_mm); > - else > - smp_mb(); > > to_kthread(tsk)->oldfs = force_uaccess_begin(); > } > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 6a1db8264c7b..917068b0a145 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4824,14 +4824,6 @@ static struct rq *finish_task_switch(struct task_struct > *prev) > fire_sched_in_preempt_notifiers(current); > > /* > - * When switching through a kernel thread, the loop in > - * membarrier_{private,global}_expedited() may have observed that > - * kernel thread and not issued an IPI. It is therefore possible to > - * schedule between user->kernel->user threads without passing though > - * switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, and mmdrop() provides > - * this barrier. > - * > * If an architecture needs to take a specific action for > * SYNC_CORE, it can do so in switch_mm_irqs_off(). > */ > @@ -4915,15 +4907,14 @@ context_switch(struct rq *rq, struct task_struct *prev, > prev->active_mm = NULL; > } else { // to user > membarrier_switch_mm(rq, prev->active_mm, next->mm); > + switch_mm_irqs_off(prev->active_mm, next->mm, next); > + > /* > * sys_membarrier() requires an smp_mb() between setting > - * rq->curr / membarrier_switch_mm() and returning to userspace. > - * > - * The below provides this either through switch_mm(), or in > - * case 'prev->active_mm == next->mm' through > - * finish_task_switch()'s mmdrop(). > + * rq->curr->mm to a membarrier-enabled mm and returning > + * to userspace. > */ > - switch_mm_irqs_off(prev->active_mm, next->mm, next); > + membarrier_finish_switch_mm(next->mm); > > if (!prev->mm) { // from kernel > /* will mmdrop() in finish_task_switch(). */ > @@ -6264,17 +6255,10 @@ static void __sched notrace __schedule(unsigned int > sched_mode) > RCU_INIT_POINTER(rq->curr, next); > /* > * The membarrier system call requires each architecture > - * to have a full memory barrier after updating > - * rq->curr, before returning to user-space. > - * > - * Here are the schemes providing that barrier on the > - * various architectures: > - * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. > - * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. > - * - finish_lock_switch() for weakly-ordered > - * architectures where spin_unlock is a full barrier, > - * - switch_to() for arm64 (weakly-ordered, spin_unlock > - * is a RELEASE barrier), > + * to have a full memory barrier before and after updating > + * rq->curr->mm, before returning to userspace. This > + * is provided by membarrier_finish_switch_mm(). Architectures > + * that want to optimize this can override that function. > */ > ++*switch_count; > > -- > 2.33.1 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com