On Tue, Jun 06, 2017 at 12:53:43PM +0200, Peter Zijlstra wrote: > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > > There would be a slowdown if 1) fast this_cpu_inc is not available and > > cannot be implemented (this usually means that atomic_inc has implicit > > memory barriers), > > I don't get this. > > How is per-cpu crud related to being strongly ordered? > > this_cpu_ has 3 forms: > > x86: single instruction > arm64,s390: preempt_disable()+atomic_op > generic: local_irq_save()+normal_op > > Only s390 is TSO, arm64 is very much a weak arch. > > > and 2) local_irq_save/restore is slower than disabling > > preemption. The main architecture with these constraints is s390, which > > however is already paying the price in __srcu_read_unlock and has not > > complained. > > IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as > fast as preempt_disable(). > > > A valid optimization on s390 would be to skip the smp_mb; > > AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation. > > You mean the s390 this_cpu_inc() in specific, right? Because > this_cpu_inc() in general does not imply any such thing. More generally, yes, the commit log needs some more help, good catch, thank you! Does the code itself also need more help? Thanx, Paul