Hmm. This actually seems to make some of the ordering worse. I'm not seeing a lot of weakening or optimization, but it depends a bit on what is common and what is not. On Wed, Jul 21, 2021 at 1:21 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > +/* > + * Increment the current CPU's rcu_data structure's ->dynticks field > + * with ordering. Return the new value. > + */ > +static noinstr unsigned long rcu_dynticks_inc(int incby) > +{ > + struct rcu_data *rdp = this_cpu_ptr(&rcu_data); > + int seq; > + > + seq = READ_ONCE(rdp->dynticks) + incby; > + smp_store_release(&rdp->dynticks, seq); > + smp_mb(); // Fundamental RCU ordering guarantee. > + return seq; > +} So this is actually likely *more* expensive than the old code was, at least on x86. The READ_ONCE/smp_store_release are cheap, but then the smp_mb() is expensive. The old code did just arch_atomic_inc_return(), which included the memory barrier. There *might* be some cache ordering advantage to letting the READ_ONCE() float upwards, but from a pure barrier standpoint this is more expensive than what we used to have. > - if (atomic_read(&rdp->dynticks) & 0x1) > + if (READ_ONCE(rdp->dynticks) & 0x1) > return; > - atomic_inc(&rdp->dynticks); > + rcu_dynticks_inc(1); And this one seems to not take advantage of the new rule, so we end up having two reads, and then that potentially more expensive sequence. > static int rcu_dynticks_snap(struct rcu_data *rdp) > { > - return atomic_add_return(0, &rdp->dynticks); > + smp_mb(); // Fundamental RCU ordering guarantee. > + return smp_load_acquire(&rdp->dynticks); > } This is likely cheaper - not because of barriers, but simply because it avoids dirtying the cacheline. So which operation do we _care_ about, and do we have numbers for why this improves anything? Because looking at the patch, it's not obvious that this is an improvement. Linus