Peter Zijlstra a écrit : > On Mon, 2008-12-08 at 18:00 -0500, Theodore Tso wrote: >> On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote: >>> atomic_t is pretty good on all archs, but you get to keep the cacheline >>> ping-pong. >>> >> Stupid question --- if you're worried about cacheline ping-pongs, why >> aren't each cpu's delta counter cacheline aligned? With a 64-byte >> cache-line, and a 32-bit counters entry, with less than 16 CPU's we're >> going to be getting cache ping-pong effects with percpu_counter's, >> right? Or am I missing something? > > sorta - a new per-cpu allocator is in the works, but we do cacheline > align the per-cpu allocations (or used to), also, the allocations are > node affine. > I did work on a 'light weight percpu counter', aka percpu_lcounter, for all metrics that dont need 64 bits wide, but a plain 'long' (network, nr_files, nr_dentry, nr_inodes, ...) struct percpu_lcounter { atomic_long_t count; #ifdef CONFIG_SMP #ifdef CONFIG_HOTPLUG_CPU struct list_head list; /* All percpu_counters are on a list */ #endif long *counters; #endif }; (No more spinlock) Then I tried to have atomic_t (or atomic_long_t) for 'counters', but got a 10% slow down of __percpu_lcounter_add(), even if never hitting the 'slow path' atomic_long_add_return() is really expensiven, even on a non contended cache line. struct percpu_lcounter { atomic_long_t count; #ifdef CONFIG_SMP #ifdef CONFIG_HOTPLUG_CPU struct list_head list; /* All percpu_counters are on a list */ #endif atomic_long_t *counters; #endif }; So I believe the percpu_clounter_sum() that tries to reset to 0 all cpu local counts would be really too expensive, if it slows down _add() so much. long percpu_lcounter_sum(struct percpu_lcounter *fblc) { long acc = 0; int cpu; for_each_online_cpu(cpu) acc += atomic_long_xchg(per_cpu_ptr(fblc->counters, cpu), 0); return atomic_long_add_return(acc, &fblc->count); } void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount, s32 batch) { long count; atomic_long_t *pcount; pcount = per_cpu_ptr(flbc->counters, get_cpu()); count = atomic_long_add_return(amount, pcount); /* way too expensive !!! */ if (unlikely(count >= batch || count <= -batch)) { atomic_long_add(count, &flbc->count); atomic_long_sub(count, pcount); } put_cpu(); } Just forget about it and let percpu_lcounter_sum() only read the values, and let percpu_lcounter_add() not using atomic ops in fast path. void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount, s32 batch) { long count; long *pcount; pcount = per_cpu_ptr(flbc->counters, get_cpu()); count = *pcount + amount; if (unlikely(count >= batch || count <= -batch)) { atomic_long_add(count, &flbc->count); count = 0; } *pcount = count; put_cpu(); } EXPORT_SYMBOL(__percpu_lcounter_add); Also, with upcoming NR_CPUS=4096, it may be time to design a hierarchical percpu_counter, to avoid hitting one shared "fbc->count" all the time a local counter overflows. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html