----- On Aug 1, 2017, at 12:03 AM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx wrote: > On Tue, Aug 01, 2017 at 12:04:05AM +0000, Mathieu Desnoyers wrote: >> ----- On Jul 31, 2017, at 12:13 PM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx >> wrote: >> >> > On Mon, Jul 31, 2017 at 01:50:29PM +1000, Stephen Rothwell wrote: >> >> Hi Paul, >> >> >> >> Today's linux-next merge of the rcu tree got a conflict in: >> >> >> >> arch/x86/mm/tlb.c >> >> >> >> between commit: >> >> >> >> 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") >> >> >> >> from the tip tree and commit: >> >> >> >> d7713e8f8b23 ("membarrier: Expedited private command") >> >> >> >> from the rcu tree. >> >> >> >> I fixed it up (the former removed the comment and the load_cr3(), so I >> >> just dropped the commend change in the latter) and can carry the fix as >> >> necessary. This is now fixed as far as linux-next is concerned, but any >> >> non trivial conflicts should be mentioned to your upstream maintainer >> >> when your tree is submitted for merging. You may also want to consider >> >> cooperating with the maintainer of the conflicting tree to minimise any >> >> particularly complex conflicts. >> > >> > Thank you, Stephen! >> > >> > Mathieu, Peter, our commit log reads as if removal of load_cr3() would >> > simply result in relying on the ordering provided by the atomic ops >> > in switch_mm() for mm_cpumask(), so that only the commit log and the >> > comment need changing. >> > >> > Please let me know if I am missing something here. >> >> I think you are right. Both load_cr3() and mm_cpumask update operations >> (LOCK prefixed) provide the appropriate barriers on x86. So it's just a >> matter of adapting the comment to the new reality. > > Like this? The updated comment in the commit message looks good, but I would be tempted to add a comment in x86 switch_mm_irqs_off() stating the following requirement just before the line invoking cpumask_set_cpu(): /* * The full memory barrier implied by mm_cpumask update operations * is required by the membarrier system call. */ Thanks, Mathieu > > Thanx, Paul > > ------------------------------------------------------------------------ > > commit fde19879b6bd1abc0c1d4d5f945efed61bf7eb8c > Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > Date: Fri Jul 28 16:40:40 2017 -0400 > > membarrier: Expedited private command > > Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built > from all runqueues for which current thread's mm is the same as the > thread calling sys_membarrier. It executes faster than the non-expedited > variant (no blocking). It also works on NOHZ_FULL configurations. > > Scheduler-wise, it requires a memory barrier before and after context > switching between processes (which have different mm). The memory > barrier before context switch is already present. For the barrier after > context switch: > > * Our TSO archs can do RELEASE without being a full barrier. Look at > x86 spin_unlock() being a regular STORE for example. But for those > archs, all atomics imply smp_mb and all of them have atomic ops in > switch_mm() for mm_cpumask(). > > * From all weakly ordered machines, only ARM64 and PPC can do RELEASE, > the rest does indeed do smp_mb(), so there the spin_unlock() is a full > barrier and we're good. > > * ARM64 has a very heavy barrier in switch_to(), which suffices. > > * PPC just removed its barrier from switch_to(), but appears to be > talking about adding something to switch_mm(). So add a > smp_mb__after_unlock_lock() for now, until this is settled on the PPC > side. > > Changes since v3: > - Properly document the memory barriers provided by each architecture. > > Changes since v2: > - Address comments from Peter Zijlstra, > - Add smp_mb__after_unlock_lock() after finish_lock_switch() in > finish_task_switch() to add the memory barrier we need after storing > to rq->curr. This is much simpler than the previous approach relying > on atomic_dec_and_test() in mmdrop(), which actually added a memory > barrier in the common case of switching between userspace processes. > - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full > kernel, rather than having the whole membarrier system call returning > -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full. > Adapt the CMD_QUERY mask accordingly. > > Changes since v1: > - move membarrier code under kernel/sched/ because it uses the > scheduler runqueue, > - only add the barrier when we switch from a kernel thread. The case > where we switch from a user-space thread is already handled by > the atomic_dec_and_test() in mmdrop(). > - add a comment to mmdrop() documenting the requirement on the implicit > memory barrier. > > CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > CC: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> > CC: Boqun Feng <boqun.feng@xxxxxxxxx> > CC: Andrew Hunter <ahh@xxxxxxxxxx> > CC: Maged Michael <maged.michael@xxxxxxxxx> > CC: gromer@xxxxxxxxxx > CC: Avi Kivity <avi@xxxxxxxxxxxx> > CC: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> > CC: Paul Mackerras <paulus@xxxxxxxxx> > CC: Michael Ellerman <mpe@xxxxxxxxxxxxxx> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> > Tested-by: Dave Watson <davejwatson@xxxxxx> > > diff --git a/MAINTAINERS b/MAINTAINERS > index f66488dfdbc9..3b035584272f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -8621,7 +8621,7 @@ M: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > M: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> > L: linux-kernel@xxxxxxxxxxxxxxx > S: Supported > -F: kernel/membarrier.c > +F: kernel/sched/membarrier.c > F: include/uapi/linux/membarrier.h > > MEMORY MANAGEMENT > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c > index 659ae8094ed5..c8f7d98d8cb9 100644 > --- a/arch/arm64/kernel/process.c > +++ b/arch/arm64/kernel/process.c > @@ -360,6 +360,8 @@ __notrace_funcgraph struct task_struct *__switch_to(struct > task_struct *prev, > /* > * Complete any pending TLB or cache maintenance on this CPU in case > * the thread migrates to a different CPU. > + * This full barrier is also required by the membarrier system > + * call. > */ > dsb(ish); > > diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h > index e0b108bd2624..6d47b3249d8a 100644 > --- a/include/uapi/linux/membarrier.h > +++ b/include/uapi/linux/membarrier.h > @@ -40,14 +40,33 @@ > * (non-running threads are de facto in such a > * state). This covers threads from all processes > * running on the system. This command returns 0. > + * @MEMBARRIER_CMD_PRIVATE_EXPEDITED: > + * Execute a memory barrier on each running > + * thread belonging to the same process as the current > + * thread. Upon return from system call, the > + * caller thread is ensured that all its running > + * threads siblings have passed through a state > + * where all memory accesses to user-space > + * addresses match program order between entry > + * to and return from the system call > + * (non-running threads are de facto in such a > + * state). This only covers threads from the > + * same processes as the caller thread. This > + * command returns 0. The "expedited" commands > + * complete faster than the non-expedited ones, > + * they never block, but have the downside of > + * causing extra overhead. > * > * Command to be passed to the membarrier system call. The commands need to > * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to > * the value 0. > */ > enum membarrier_cmd { > - MEMBARRIER_CMD_QUERY = 0, > - MEMBARRIER_CMD_SHARED = (1 << 0), > + MEMBARRIER_CMD_QUERY = 0, > + MEMBARRIER_CMD_SHARED = (1 << 0), > + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ > + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ > + MEMBARRIER_CMD_PRIVATE_EXPEDITED = (1 << 3), > }; > > #endif /* _UAPI_LINUX_MEMBARRIER_H */ > diff --git a/kernel/Makefile b/kernel/Makefile > index 4cb8e8b23c6e..9c323a6daa46 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -108,7 +108,6 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o > obj-$(CONFIG_JUMP_LABEL) += jump_label.o > obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o > obj-$(CONFIG_TORTURE_TEST) += torture.o > -obj-$(CONFIG_MEMBARRIER) += membarrier.o > > obj-$(CONFIG_HAS_IOMEM) += memremap.o > > diff --git a/kernel/membarrier.c b/kernel/membarrier.c > deleted file mode 100644 > index 9f9284f37f8d..000000000000 > --- a/kernel/membarrier.c > +++ /dev/null > @@ -1,70 +0,0 @@ > -/* > - * Copyright (C) 2010, 2015 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > - * > - * membarrier system call > - * > - * This program is free software; you can redistribute it and/or modify > - * it under the terms of the GNU General Public License as published by > - * the Free Software Foundation; either version 2 of the License, or > - * (at your option) any later version. > - * > - * This program is distributed in the hope that it will be useful, > - * but WITHOUT ANY WARRANTY; without even the implied warranty of > - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > - * GNU General Public License for more details. > - */ > - > -#include <linux/syscalls.h> > -#include <linux/membarrier.h> > -#include <linux/tick.h> > - > -/* > - * Bitmask made from a "or" of all commands within enum membarrier_cmd, > - * except MEMBARRIER_CMD_QUERY. > - */ > -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) > - > -/** > - * sys_membarrier - issue memory barriers on a set of threads > - * @cmd: Takes command values defined in enum membarrier_cmd. > - * @flags: Currently needs to be 0. For future extensions. > - * > - * If this system call is not implemented, -ENOSYS is returned. If the > - * command specified does not exist, or if the command argument is invalid, > - * this system call returns -EINVAL. For a given command, with flags argument > - * set to 0, this system call is guaranteed to always return the same value > - * until reboot. > - * > - * All memory accesses performed in program order from each targeted thread > - * is guaranteed to be ordered with respect to sys_membarrier(). If we use > - * the semantic "barrier()" to represent a compiler barrier forcing memory > - * accesses to be performed in program order across the barrier, and > - * smp_mb() to represent explicit memory barriers forcing full memory > - * ordering across the barrier, we have the following ordering table for > - * each pair of barrier(), sys_membarrier() and smp_mb(): > - * > - * The pair ordering is detailed as (O: ordered, X: not ordered): > - * > - * barrier() smp_mb() sys_membarrier() > - * barrier() X X O > - * smp_mb() X O O > - * sys_membarrier() O O O > - */ > -SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > -{ > - /* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */ > - if (tick_nohz_full_enabled()) > - return -ENOSYS; > - if (unlikely(flags)) > - return -EINVAL; > - switch (cmd) { > - case MEMBARRIER_CMD_QUERY: > - return MEMBARRIER_CMD_BITMASK; > - case MEMBARRIER_CMD_SHARED: > - if (num_online_cpus() > 1) > - synchronize_sched(); > - return 0; > - default: > - return -EINVAL; > - } > -} > diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile > index 53f0164ed362..78f54932ea1d 100644 > --- a/kernel/sched/Makefile > +++ b/kernel/sched/Makefile > @@ -25,3 +25,4 @@ obj-$(CONFIG_SCHED_DEBUG) += debug.o > obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o > obj-$(CONFIG_CPU_FREQ) += cpufreq.o > obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o > +obj-$(CONFIG_MEMBARRIER) += membarrier.o > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index bfee6ea7db49..f77269c6c2f8 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2640,6 +2640,16 @@ static struct rq *finish_task_switch(struct task_struct > *prev) > prev_state = prev->state; > vtime_task_switch(prev); > perf_event_task_sched_in(prev, current); > + /* > + * The membarrier system call requires a full memory barrier > + * after storing to rq->curr, before going back to user-space. > + * > + * TODO: This smp_mb__after_unlock_lock can go away if PPC end > + * up adding a full barrier to switch_mm(), or we should figure > + * out if a smp_mb__after_unlock_lock is really the proper API > + * to use. > + */ > + smp_mb__after_unlock_lock(); > finish_lock_switch(rq, prev); > finish_arch_post_lock_switch(); > > @@ -3329,6 +3339,21 @@ static void __sched notrace __schedule(bool preempt) > if (likely(prev != next)) { > rq->nr_switches++; > rq->curr = next; > + /* > + * The membarrier system call requires each architecture > + * to have a full memory barrier after updating > + * rq->curr, before returning to user-space. For TSO > + * (e.g. x86), the architecture must provide its own > + * barrier in switch_mm(). For weakly ordered machines > + * for which spin_unlock() acts as a full memory > + * barrier, finish_lock_switch() in common code takes > + * care of this barrier. For weakly ordered machines for > + * which spin_unlock() acts as a RELEASE barrier (only > + * arm64 and PowerPC), arm64 has a full barrier in > + * switch_to(), and PowerPC has > + * smp_mb__after_unlock_lock() before > + * finish_lock_switch(). > + */ > ++*switch_count; > > trace_sched_switch(preempt, prev, next); > diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c > new file mode 100644 > index 000000000000..a92fddc22747 > --- /dev/null > +++ b/kernel/sched/membarrier.c > @@ -0,0 +1,152 @@ > +/* > + * Copyright (C) 2010-2017 Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > + * > + * membarrier system call > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + */ > + > +#include <linux/syscalls.h> > +#include <linux/membarrier.h> > +#include <linux/tick.h> > +#include <linux/cpumask.h> > + > +#include "sched.h" /* for cpu_rq(). */ > + > +/* > + * Bitmask made from a "or" of all commands within enum membarrier_cmd, > + * except MEMBARRIER_CMD_QUERY. > + */ > +#define MEMBARRIER_CMD_BITMASK \ > + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) > + > +static void ipi_mb(void *info) > +{ > + smp_mb(); /* IPIs should be serializing but paranoid. */ > +} > + > +static void membarrier_private_expedited(void) > +{ > + int cpu; > + bool fallback = false; > + cpumask_var_t tmpmask; > + > + if (num_online_cpus() == 1) > + return; > + > + /* > + * Matches memory barriers around rq->curr modification in > + * scheduler. > + */ > + smp_mb(); /* system call entry is not a mb. */ > + > + /* > + * Expedited membarrier commands guarantee that they won't > + * block, hence the GFP_NOWAIT allocation flag and fallback > + * implementation. > + */ > + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { > + /* Fallback for OOM. */ > + fallback = true; > + } > + > + cpus_read_lock(); > + for_each_online_cpu(cpu) { > + struct task_struct *p; > + > + /* > + * Skipping the current CPU is OK even through we can be > + * migrated at any point. The current CPU, at the point > + * where we read raw_smp_processor_id(), is ensured to > + * be in program order with respect to the caller > + * thread. Therefore, we can skip this CPU from the > + * iteration. > + */ > + if (cpu == raw_smp_processor_id()) > + continue; > + rcu_read_lock(); > + p = task_rcu_dereference(&cpu_rq(cpu)->curr); > + if (p && p->mm == current->mm) { > + if (!fallback) > + __cpumask_set_cpu(cpu, tmpmask); > + else > + smp_call_function_single(cpu, ipi_mb, NULL, 1); > + } > + rcu_read_unlock(); > + } > + if (!fallback) { > + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); > + free_cpumask_var(tmpmask); > + } > + cpus_read_unlock(); > + > + /* > + * Memory barrier on the caller thread _after_ we finished > + * waiting for the last IPI. Matches memory barriers around > + * rq->curr modification in scheduler. > + */ > + smp_mb(); /* exit from system call is not a mb */ > +} > + > +/** > + * sys_membarrier - issue memory barriers on a set of threads > + * @cmd: Takes command values defined in enum membarrier_cmd. > + * @flags: Currently needs to be 0. For future extensions. > + * > + * If this system call is not implemented, -ENOSYS is returned. If the > + * command specified does not exist, not available on the running > + * kernel, or if the command argument is invalid, this system call > + * returns -EINVAL. For a given command, with flags argument set to 0, > + * this system call is guaranteed to always return the same value until > + * reboot. > + * > + * All memory accesses performed in program order from each targeted thread > + * is guaranteed to be ordered with respect to sys_membarrier(). If we use > + * the semantic "barrier()" to represent a compiler barrier forcing memory > + * accesses to be performed in program order across the barrier, and > + * smp_mb() to represent explicit memory barriers forcing full memory > + * ordering across the barrier, we have the following ordering table for > + * each pair of barrier(), sys_membarrier() and smp_mb(): > + * > + * The pair ordering is detailed as (O: ordered, X: not ordered): > + * > + * barrier() smp_mb() sys_membarrier() > + * barrier() X X O > + * smp_mb() X O O > + * sys_membarrier() O O O > + */ > +SYSCALL_DEFINE2(membarrier, int, cmd, int, flags) > +{ > + if (unlikely(flags)) > + return -EINVAL; > + switch (cmd) { > + case MEMBARRIER_CMD_QUERY: > + { > + int cmd_mask = MEMBARRIER_CMD_BITMASK; > + > + if (tick_nohz_full_enabled()) > + cmd_mask &= ~MEMBARRIER_CMD_SHARED; > + return cmd_mask; > + } > + case MEMBARRIER_CMD_SHARED: > + /* MEMBARRIER_CMD_SHARED is not compatible with nohz_full. */ > + if (tick_nohz_full_enabled()) > + return -EINVAL; > + if (num_online_cpus() > 1) > + synchronize_sched(); > + return 0; > + case MEMBARRIER_CMD_PRIVATE_EXPEDITED: > + membarrier_private_expedited(); > + return 0; > + default: > + return -EINVAL; > + } > +} -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-next" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html