On Wed, Oct 07, 2015 at 04:00:42PM -0400, Waiman Long wrote: > On 10/06/2015 05:30 PM, Dave Chinner wrote: > >>>/* > >>> * Aggregate the per-cpu counter magazines back into the global > >>> * counter. This avoids the need for repeated compare operations to > >>> * run the slow path when the majority of the counter value is held > >>> * in the per-cpu magazines. Folding them back into the global > >>> * counter means we will continue to hit the fast > >>> * percpu_counter_read() path until the counter value falls > >>> * completely within the comparison limit passed to > >>> * __percpu_counter_compare(). > >>> */ > >>>static s64 percpu_counter_aggregate(struct percpu_counter *fbc) > >>>{ > >>> s64 ret; > >>> int cpu; > >>> unsigned long flags; > >>> > >>> raw_spin_lock_irqsave(&fbc->lock, flags); > >>> ret = fbc->count; > >>> for_each_online_cpu(cpu) { > >>> s32 count = __this_cpu_read(*fbc->counters); > >>> ret += count; > >>> __this_cpu_sub(*fbc->counters, count) > >>> } > >>> fbc->count = ret; > >>> raw_spin_unlock_irqrestore(&fbc->lock, flags); > >>> return ret; > >>>} > >>I don't think that will work as some other CPUs may change the > >>percpu counters values between percpu_counter_aggregate() and > >>__percpu_counter_compare(). To be safe, the precise counter has to > >>be compted whenever the comparison value difference is less than > >>nr_cpus * batch size. > >Well, yes. Why do you think the above function does the same > >function as percpu_counter_sum()? So that the percpu_counter_sum() > >call *inside* __percpu_counter_compare() can be replaced by this > >call. i.e. > > > > return -1; > > } > > /* Need to use precise count */ > >- count = percpu_counter_sum(fbc); > >+ count = percpu_counter_aggregate(fbc); > > if (count> rhs) > > return 1; > > else if (count< rhs) > > > >Please think about what I'm saying rather than dismissing it without > >first understanding my suggestions. > > I understood what you were saying. However, the per-cpu counter > isn't protected by the spinlock. Reading it is OK, but writing may > cause race if that counter is modified by a CPU other than its > owning CPU. <sigh> You're still trying to pick apart the code without considering what we need to acheive. We don't need to the code to be bullet proof to test whether this hypothesis is correct or not - we just need something that is "near-enough" to give us the data point to tell us where we should focus our efforts. If optimising the counter like above does not reduce the overhead, then we may have to change XFS. If it does reduce the overhead, then the XFS code remains unchanged and we focus on optimising the counter code. But we *need to test the hypothesis first*. As it is, the update race you pointed out is easy to solve with __this_cpu_cmpxchg rather than _this_cpu_sub (similar to mod_state() in the MM percpu counter stats code, perhaps). So please test the above change and stop quibbling over details that just don't matter until we know which way we need to go. > The slow performance of percpu_counter_sum() is due to its need to > access n different (likely cold) cachelines where n is the number of > CPUs in the system. So the larger the system, the more problematic > it will be. Welcome to today's lecture titled "Per-cpu Counters 101". :/ Have a think about who you are lecturing(*). Please don't treat me as you would university intern who's just learning about multithreaded programming, because a) it's disrepectful, and b) you'll jump to the wrong conclusions because you may not immediately understand what I'm saying or why I'm asking you to do something. I always start by assuming participants understand the topic being discussed. I can quickly tell if a person has deep knowledge of the topic from their responses and the above "explain the basics" response is a key indicator. If you assume that the other person understands the topic as well or better than you do then you won't make this mistake and, better yet, we might all learn something in the ensuing discussion. Cheers, Dave. (*) XFS had custom per-cpu counters because there was no generic infrastructure in linux for this ten years ago: commit 8d280b98cfe3c0b69c37d355218975c1c0279bb0 Author: David Chinner <dgc@xxxxxxx> Date: Tue Mar 14 13:13:09 2006 +1100 [XFS] On machines with more than 8 cpus, when running parallel I/O threads, the incore superblock lock becomes the limiting factor for buffered write throughput. Make the contended fields in the incore superblock use per-cpu counters so that there is no global lock to limit scalability. SGI-PV: 946630 SGI-Modid: xfs-linux-melb:xfs-kern:25106a Signed-off-by: David Chinner <dgc@xxxxxxx> Signed-off-by: Nathan Scott <nathans@xxxxxxx> -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs