On Fri, Sep 29, 2017 at 02:27:57AM +1000, Nicholas Piggin wrote: > The biggest power boxes are more tightly coupled than those big > SGI systems, but even so just plodding along taking and releasing > locks in turn would be fine on those SGI ones as well really. Not DoS > level. This is not a single mega hot cache line or lock that is > bouncing over the entire machine, but one process grabbing a line and > lock from each of 1000 CPUs. > > Slight disturbance sure, but each individual CPU will see it as 1/1000th > of a disturbance, most of the cost will be concentrated in the syscall > caller. But once the: while (1) sys_membarrier() thread has all those (lock) lines in M state locally, it will become very hard for the remote CPUs to claim them back, because its constantly touching them. Sure it will touch a 1000 other lines before its back to this one, but if they're all local that's fairly quick. But you're right, your big machines have far smaller NUMA factors. > > Bouncing that lock across the machine is *painful*, I have vague > > memories of cases where the lock ping-pong was most the time spend. > > > > But only Power needs this, all the other architectures are fine with the > > lockless approach for MEMBAR_EXPEDITED_PRIVATE. > > Yes, we can add an iterator function that power can override in a few > lines. Less arch specific code than this proposal. A semi related issue; I suppose we can do a arch upcall to flush_tlb_mm and reset the mm_cpumask when we change cpuset groups.