On 5/20/24 9:14 AM, Mateusz Guzik wrote: > This was "percpu_counter: reimplement _add_batch with __this_cpu_cmpxchg". > > I chatted with vbabka a little bit and he pointed me at mod_zone_state, > which does the same thing I needed except dodges preemption -- turns out > cmpxchg with a gs-prefixed argument is safe here. > > ================ cut here ================ > > Interrupt disable/enable trips are quite expensive on x86-64 compared to > a mere cmpxchg (note: no lock prefix!) and percpu counters are used > quite often. > > With this change I get a bump of 1% ops/s for negative path lookups, > plugged into will-it-scale: > > void testcase(unsigned long long *iterations, unsigned long nr) > { > while (1) { > int fd = open("/tmp/nonexistent", O_RDONLY); > assert(fd == -1); > > (*iterations)++; > } > } > > The win would be higher if it was not for other slowdowns, but one has > to start somewhere. > > v2: > - dodge preemption > - use this_cpu_try_cmpxchg > - keep the old variant depending on CONFIG_HAVE_CMPXCHG_LOCAL > > Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx> Acked-by: Vlastimil Babka <vbabka@xxxxxxx> I tried a stupid microbenchmark just doing percpu_counter_inc() in a loop and this cut the time by almost 50%. As we discussed, should be also possible to make the fastpath inlined as the next step, to avoid the function calls that are stupid expensive with cpu mitigations. > --- > lib/percpu_counter.c | 44 +++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 39 insertions(+), 5 deletions(-) > > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > index 44dd133594d4..80ec2ffc981a 100644 > --- a/lib/percpu_counter.c > +++ b/lib/percpu_counter.c > @@ -73,17 +73,50 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount) > EXPORT_SYMBOL(percpu_counter_set); > > /* > - * local_irq_save() is needed to make the function irq safe: > - * - The slow path would be ok as protected by an irq-safe spinlock. > - * - this_cpu_add would be ok as it is irq-safe by definition. > - * But: > - * The decision slow path/fast path and the actual update must be atomic, too. > + * Add to a counter while respecting batch size. > + * > + * There are 2 implementations, both dealing with the following problem: > + * > + * The decision slow path/fast path and the actual update must be atomic. > * Otherwise a call in process context could check the current values and > * decide that the fast path can be used. If now an interrupt occurs before > * the this_cpu_add(), and the interrupt updates this_cpu(*fbc->counters), > * then the this_cpu_add() that is executed after the interrupt has completed > * can produce values larger than "batch" or even overflows. > */ > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL > +/* > + * Safety against interrupts is achieved in 2 ways: > + * 1. the fast path uses local cmpxchg (note: no lock prefix) > + * 2. the slow path operates with interrupts disabled > + */ > +void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > +{ > + s64 count; > + unsigned long flags; > + > + count = this_cpu_read(*fbc->counters); > + do { > + if (unlikely(abs(count + amount)) >= batch) { > + raw_spin_lock_irqsave(&fbc->lock, flags); > + /* > + * Note: by now might have migrated to another CPU or > + * the value might have changed. > + */ > + count = __this_cpu_read(*fbc->counters); > + fbc->count += count + amount; > + __this_cpu_sub(*fbc->counters, count); > + raw_spin_unlock_irqrestore(&fbc->lock, flags); > + return; > + } > + } while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount)); > +} > +#else > +/* > + * local_irq_save() is used to make the function irq safe: > + * - The slow path would be ok as protected by an irq-safe spinlock. > + * - this_cpu_add would be ok as it is irq-safe by definition. > + */ > void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > { > s64 count; > @@ -101,6 +134,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch) > } > local_irq_restore(flags); > } > +#endif > EXPORT_SYMBOL(percpu_counter_add_batch); > > /*