On 01/07/2019 09:04 PM, Dave Chinner wrote: > On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote: >> On 01/07/2019 05:32 PM, Dave Chinner wrote: >>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote: >>>> As newer systems have more and more IRQs and CPUs available in their >>>> system, the performance of reading /proc/stat frequently is getting >>>> worse and worse. >>> Because the "roll-your-own" per-cpu counter implementaiton has been >>> optimised for low possible addition overhead on the premise that >>> summing the counters is rare and isn't a performance issue. This >>> patchset is a direct indication that this "summing is rare and can >>> be slow" premise is now invalid. >>> >>> We have percpu counter infrastructure that trades off a small amount >>> of addition overhead for zero-cost reading of the counter value. >>> i.e. why not just convert this whole mess to percpu_counters and >>> then just use percpu_counter_read_positive()? Then we just don't >>> care how often userspace reads the /proc file because there is no >>> summing involved at all... >>> >>> Cheers, >>> >>> Dave. >> Yes, percpu_counter_read_positive() is cheap. However, you still need to >> pay the price somewhere. In the case of percpu_counter, the update is >> more expensive. > Ummm, that's exactly what I just said. It's a percpu counter that > solves the "sum is expensive and frequent" problem, just like you > are encountering here. I do not need basic scalability algorithms > explained to me. What I am trying to say is that the "sum is expensive and frequent" is only true of a very small percentage of applications. It is not true for most of them. I am hesitating to add latency to the interrupt path that will affect all applications. >> I would say the percentage of applications that will hit this problem is >> small. But for them, this problem has some significant performance overhead. > Well, duh! > > What I was suggesting is that you change the per-cpu counter > implementation to the /generic infrastructure/ that solves this > problem, and then determine if the extra update overhead is at all > measurable. If you can't measure any difference in update overhead, > then slapping complexity on the existing counter to attempt to > mitigate the summing overhead is the wrong solution. > > Indeed, it may be that you need o use a custom batch scaling curve > for the generic per-cpu coutner infrastructure to mitigate the > update overhead, but the fact is we already have generic > infrastructure that solves your problem and so the solution should > be "use the generic infrastructure" until it can be proven not to > work. > > i.e. prove the generic infrastructure is not fit for purpose and > cannot be improved sufficiently to work for this use case before > implementing a complex, one-off snowflake counter implementation... I see your point. I like the deferred summation approach that I am currently using. If I have to modify the current per-cpu counter implementation to support that and I probably need to add counter grouping support to amortize the overhead, that can be a major undertaking. This is not a high priority item for me at the moment, so I may have to wait until I have some spare time left. Thanks, Longman