Hi Linus, Thanks for having a look. On Fri, Feb 22, 2019 at 01:49:32PM -0800, Linus Torvalds wrote: > On Fri, Feb 22, 2019 at 10:50 AM Will Deacon <will.deacon@xxxxxxx> wrote: > > > > +#ifndef mmiowb_set_pending > > +static inline void mmiowb_set_pending(void) > > +{ > > + __this_cpu_write(__mmiowb_state.mmiowb_pending, 1); > > +} > > +#endif > > + > > +#ifndef mmiowb_spin_lock > > +static inline void mmiowb_spin_lock(void) > > +{ > > + if (__this_cpu_inc_return(__mmiowb_state.nesting_count) == 1) > > + __this_cpu_write(__mmiowb_state.mmiowb_pending, 0); > > +} > > +#endif > > The case we want to go fast is the spin-lock and unlock case, not the > "set pending" case. > > And the way you implemented this, it's exactly the wrong way around. > > So I'd suggest instead doing > > static inline void mmiowb_set_pending(void) > { > __this_cpu_write(__mmiowb_state.mmiowb_pending, > __mmiowb_state.nesting_count); > } > > and > > static inline void mmiowb_spin_lock(void) > { > __this_cpu_inc(__mmiowb_state.nesting_count); > } > > which makes that spin-lock code much simpler and avoids the conditional there. Makes sense; I'll hook that up for the next version. > Then the unlock case could be something like > > static inline void mmiowb_spin_unlock(void) > { > if (unlikely(__this_cpu_read(__mmiowb_state.mmiowb_pending))) { > __this_cpu_write(__mmiowb_state.mmiowb_pending, 0); > mmiowb(); > } > __this_cpu_dec(__mmiowb_state.nesting_count); > } > > or something (xchg is generally much more expensive than read, and the > common case for spinlocks is that nobody did IO inside of it). So I *am* using __this_cpu_xchg() here, which means the architecture can get away with plain old loads and stores (which is what RISC-V does, for example), but I see that's not the case on e.g. x86 so I'll rework using read() and write() because it doesn't hurt. Will