Hello, please pardon delay (and probably bad temper). I'm still sick & slow. On Fri, Apr 22, 2011 at 10:33:00AM +0800, Shaohua Li wrote: > > And, no matter what, that's a separate issue from the this_cpu hot > > path optimizations and should be done separately. So, _please_ update > > this_cpu patch so that it doesn't change the slow path semantics. > > in the original implementation, a updater can change several times too, > it can update the count from -(batch -1) to (batch -1) without holding > the lock. so we always have batch*num_cpus*2 deviate That would be a pathelogical case but, even then, after the change the number becomes much higher as it becomes a function of batch * num_updaters, right? I'll try to re-summarize my concerns as my communications don't seem to be getting through very well these few days (likely my fault). The biggest issue I have with the change is that with the suggested changes, the devaition seen by _sum becomes much less predictable. _sum can't be accurate. It never was and never will be, but the deviations have been quite predictable regardless of @batch. It's dependent only on the number and frequency of concurrent updaters. If concurrent updates aren't very frequent and numerous, the caller is guaranteed to get a result which deviates only by quite small margin. If concurrent updates are very frequent and numerous, the caller natuarally can't expect a very accurate result. However, after the change, especially with high @batch count, the result may deviate significantly even with low frequency concurrent updates. @batch deviations won't happen often but will happen once in a while, which is just nasty and makes the API much less useful and those occasional deviations can cause sporadic erratic behaviors - e.g. filesystems use it for free block accounting. It's actually used for somewhat critical decision making. If it were in the fast path, sure, we might and plan for slower contingencies where accuracy is more important, but we're talking about slow path already - it's visiting each per-cpu area for $DEITY's sake, so the tradeoff doesn't make a lot of sense to me. > if we really worry about _sum deviates too much. can we do something > like this: > percpu_counter_sum > { > again: > sum=0 > old = atomic64_read(&fbc->counter) > for_each_online_cpu() > sum += per cpu counter > new = atomic64_read(&fbc->counter) > if (new - old > batch * num_cpus || old - new > batch * num_cpus) > goto again; > return new + sum; > } > in this way we limited the deviate to number of concurrent updater. This > doesn't make _sum too slow too, because we have the batch * num_cpus > check. I don't really worry about _sum performance. It's a quite slow path and most of the cost is from causing cacheline bounces anyway. That said, I don't see how the above would help the deviation problem. Let's say an updater reset per cpu counter but got preempted before updating the global counter. What differences does it make to check fbc->counter before & after like above? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>