On Mon, Oct 04, 2021 at 11:07:45AM -0700, Shakeel Butt wrote: > > Sorry for being so slow but can you point to the exact call path which gets > > slowed down so significantly? > > This is the mem_cgroup_flush_stats() inside workingset_refault() in > mm/workingset.c. I see. Was looking at a repo which was too old. > > I'm mostly wondering whether we need some sort > > of time-batched flushes because even with lock avoidance the flush path > > really isn't great when called frequently. We can mitigate it further if > > necessary - e.g. by adding an "updated" bitmap so that the flusher doesn't > > have to go around touching the cachelines for all the cpus. > > For the memcg stats, I already proposed a batched flush at [1]. > > I actually did perform the same experiment with the proposed patch > along with [1] and it improves around just 1%. More specifically for > memcg stats [1] is good enough but that is memcg specific and this > patch has merits on its own. So, the current rstat code doesn't pay a lot of attention to optimizing the read path - the reasoning being that as long as we avoid O(nr_cgroups), the flush operations aren't frequent enough to be problematic. The use in refault path seems to change that balance and it likely is worthwhile to update rstat accordingly. As I mentioned above, a next step could be adding a cpumask which tracks cpus with populated updated list, which should add pretty small cost to the writers while making frequent flushes significantly cheaper. What do you think about that approach? While the proposed patch looks fine, it kinda bothers me that it's a very partial optimization - ie. if flush frequency is high enough for this to matter, that for_each_possible_cpu() scanning loop really isn't appropriate. Thanks. -- tejun