----- On Aug 1, 2017, at 9:43 AM, Andy Lutomirski luto@xxxxxxxxxx wrote: > On Mon, Jul 31, 2017 at 9:03 PM, Paul E. McKenney > <paulmck@xxxxxxxxxxxxxxxxxx> wrote: >> On Tue, Aug 01, 2017 at 12:04:05AM +0000, Mathieu Desnoyers wrote: >>> ----- On Jul 31, 2017, at 12:13 PM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx >>> wrote: >>> > >> Thanx, Paul >> >> ------------------------------------------------------------------------ >> >> commit fde19879b6bd1abc0c1d4d5f945efed61bf7eb8c >> Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> >> Date: Fri Jul 28 16:40:40 2017 -0400 >> >> membarrier: Expedited private command >> >> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built >> from all runqueues for which current thread's mm is the same as the >> thread calling sys_membarrier. It executes faster than the non-expedited >> variant (no blocking). It also works on NOHZ_FULL configurations. >> >> Scheduler-wise, it requires a memory barrier before and after context >> switching between processes (which have different mm). The memory >> barrier before context switch is already present. For the barrier after >> context switch: >> >> * Our TSO archs can do RELEASE without being a full barrier. Look at >> x86 spin_unlock() being a regular STORE for example. But for those >> archs, all atomics imply smp_mb and all of them have atomic ops in >> switch_mm() for mm_cpumask(). > > I think that, on x86, context switches, even without mm changes, must > at least flush the store buffer (maybe SFENCE is okay) to avoid > visible inconsistency due to store-buffer forwarding. > > Anyway, can you document whatever property you require with a comment > in switch_mm() or wherever you're finding that property so that future > arch changes don't break it? As I asked to Paul in my reply to his proposed manual merge, we should indeed have a comment in switch_mm() stating something like this just before the line invoking cpumask_set_cpu(): /* * The full memory barrier implied by mm_cpumask update operations * is required by the membarrier system call. */ What we want to order here is: prev userspace memory accesses schedule <full mb> (it's already there) [A] update to rq->curr changing the rq->curr->mm value <full mb> (provided by mm_cpumask updates in switch_mm on x86) [B] next userspace memory accesses wrt to: userspace memory accesses sys_membarrier <full mb> [C] iterate on each cpu's rq->curr, compare their "mm" to current->mm IPI each CPU that match <full mb> [D] userspace memory accesses [A] pairs with [D] and [B] pairs with [C]. > >> +static void membarrier_private_expedited(void) >> +{ >> + int cpu; >> + bool fallback = false; >> + cpumask_var_t tmpmask; >> + >> + if (num_online_cpus() == 1) >> + return; >> + >> + /* >> + * Matches memory barriers around rq->curr modification in >> + * scheduler. >> + */ >> + smp_mb(); /* system call entry is not a mb. */ >> + >> + /* >> + * Expedited membarrier commands guarantee that they won't >> + * block, hence the GFP_NOWAIT allocation flag and fallback >> + * implementation. >> + */ >> + if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { >> + /* Fallback for OOM. */ >> + fallback = true; >> + } >> + >> + cpus_read_lock(); >> + for_each_online_cpu(cpu) { >> + struct task_struct *p; >> + >> + /* >> + * Skipping the current CPU is OK even through we can be >> + * migrated at any point. The current CPU, at the point >> + * where we read raw_smp_processor_id(), is ensured to >> + * be in program order with respect to the caller >> + * thread. Therefore, we can skip this CPU from the >> + * iteration. >> + */ >> + if (cpu == raw_smp_processor_id()) >> + continue; >> + rcu_read_lock(); >> + p = task_rcu_dereference(&cpu_rq(cpu)->curr); >> + if (p && p->mm == current->mm) { > > I'm a bit surprised you're iterating all CPUs instead of just CPUs in > mm_cpumask(). I see two reasons for this. The first is because architectures like ARM64 don't even bother populating the mm_cpumask. The second reason is because I don't think all architectures ensure that updates to mm_cpumask imply full memory barriers. Therefore, we would need to revisit each architecture switch_mm to ensure mm_cpumask bit set ops either imply a full memory barrier, or are followed by an explicit one, if we choose to use this bitmask as an optimization. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-next" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html