On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote: > There would be a slowdown if 1) fast this_cpu_inc is not available and > cannot be implemented (this usually means that atomic_inc has implicit > memory barriers), I don't get this. How is per-cpu crud related to being strongly ordered? this_cpu_ has 3 forms: x86: single instruction arm64,s390: preempt_disable()+atomic_op generic: local_irq_save()+normal_op Only s390 is TSO, arm64 is very much a weak arch. > and 2) local_irq_save/restore is slower than disabling > preemption. The main architecture with these constraints is s390, which > however is already paying the price in __srcu_read_unlock and has not > complained. IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as fast as preempt_disable(). > A valid optimization on s390 would be to skip the smp_mb; > AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation. You mean the s390 this_cpu_inc() in specific, right? Because this_cpu_inc() in general does not imply any such thing.