On Thu, Dec 28, 2023 at 07:13:23AM -0800, Yosry Ahmed wrote: > On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > > > One of our internal workload regressed on newer upstream kernel and on > > further investigation, it seems like the cause is the always synchronous > > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b > > ("mm: memcg: use rstat for non-hierarchical stats"). On further > > inspection it seems like we don't really need accurate stats in this > > function as it was already approximating the amount of appropriate > > shadow entried to keep for maintaining the refault information. Since > > s/entried/entries > > > there is already 2 sec periodic rstat flush, we don't need exact stats > > here. Let's ratelimit the rstat flush in this code path. > > Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg: > restore subtree stats flushing")? I think the answer is yes based on > internal discussions, but this really surprises me. > Yes, the regression was on latest mm-stable branch of Andrew's mm tree. > Commit f82e6bf9bb9b removed the percpu loop in > lruvec_page_state_local(), and added a flush call. With 7d7ef0a4686a, > the flush call is only effective if there are pending updates in the > cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW, > we should only be doing work when actually needed, whereas before we > used to have multiple percpu loops in count_shadow_nodes() regardless > of pending updates. > > It seems like the cgroup subtree is very active that we continuously > need to flush in count_shadow_nodes()? If that's the case, do we still > think it's okay not to flush when we know there are pending updates? I > don't have enough background about the workingset heuristics to judge > this. Not all updates might be related to the stats being read here. Also the read value is further divided by 8 and manipulated more in do_shrink_slab(). So, I don't think we need less than 2 seconds accuracy for these stats here.