On Mon, Jun 17, 2024 at 05:31:21PM GMT, Jesper Dangaard Brouer wrote: > > > On 16/06/2024 02.28, Yosry Ahmed wrote: > > On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > > > > > The Meta prod is seeing large amount of stalls in memcg stats flush > > > from the memcg reclaim code path. At the moment, this specific callsite > > > is doing a synchronous memcg stats flush. The rstat flush is an > > > expensive and time consuming operation, so concurrent relaimers will > > > busywait on the lock potentially for a long time. Actually this issue is > > > not unique to Meta and has been observed by Cloudflare [1] as well. For > > > the Cloudflare case, the stalls were due to contention between kswapd > > > threads running on their 8 numa node machines which does not make sense > > > as rstat flush is global and flush from one kswapd thread should be > > > sufficient for all. Simply replace the synchronous flush with the > > > ratelimited one. > > > > > Like Yosry, I don't agree that simply using ratelimited flush here is > the right solution, at-least other options need to be investigated first. I added more detail in my reply to Yosry on why using ratelimited flush for this specific case is fine. [...] > > > > I think you already know my opinion about this one :) I don't like it > > at all, and I will explain why below. I know it may be a necessary > > evil, but I would like us to make sure there is no other option before > > going forward with this. > > > I'm signing up to solving this somehow, as this is a real prod issue. > > An easy way to solve the kswapd issue, would be to reintroduce > "stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm: > memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and > introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an > atomic") (Author: Yosry Ahmed). > The skipping flush for "stats_flush_ongoing" was there from the start. > The concept is: If there is an ongoing rstat flush, this time limited to > the root cgroup, then don't perform the flush. We can only do this for > the root cgroup tree, as flushing can be done for subtrees, but kswapd > is always for root tree, so it is good enough to solve the kswapd > thundering herd problem. We might want to generalize this beyond memcg. > No objection from me for this skipping root memcg flush idea. > [...] > > > - With the added thresholding code, a flush is only done if there is a > > significant number of pending updates in the relevant subtree. > > Choosing the ratelimited approach is intentionally ignoring a > > significant change in stats (although arguably it could be irrelevant > > stats). > > > > My production observations are that the thresholding code isn't limiting > the flushing in practice. > Here we need more production data. I remember you mentioned MEMCG_KMEM being used for most of the updates. Is it possible to get top 5 (or 10) most updated stats for your production environment? > > > - Reclaim code is an iterative process, so not updating the stats on > > every retry is very counterintuitive. We are retrying reclaim using > > the same stats and heuristics used by a previous iteration, > > essentially dismissing the effects of those previous iterations. > > > > - Indeterministic behavior like this one is very difficult to debug if > > it causes problems. The missing updates in the last 2s (or whatever > > period) could be of any magnitude. We may be ignoring GBs of > > free/allocated memory. What's worse is, if it causes any problems, > > tracing it back to this flush will be extremely difficult. > > > > The 2 sec seems like a long period for me. > > > What can we do? > > > > - Try to make more fundamental improvements to the flushing code (for > > memcgs or cgroups in general). The per-memcg flushing thresholding is > > an example of this. For example, if flushing is taking too long > > because we are flushing all subsystems, it may make sense to have > > separate rstat trees for separate subsystems. > > > > One other thing we can try is add a mutex in the memcg flushing path. > > I had initially had this in my subtree flushing series [1], but I > > dropped it as we thought it's not very useful. > > I'm running an experimental kernel with rstat lock converted to mutex on > a number of production servers, and we have not observed any regressions. > The kswapd thundering herd problem also happen on these machines, but as > these are sleep-able background threads, it is fine to sleep on the mutex. > Sorry but a global mutex which can be taken by userspace applications and is needed by node controller (to read stats) is a no from me. On a multi-tenant systems, global locks causing priority inversion is a real issue. > [...] > > My pipe dream is that kernel can avoiding the cost of maintain the > cgroup threshold stats for flushing, and instead rely on a dynamic time > based threshold (in ms area) that have no fast-path overhead :-P > Please do expand on what you mean by dynamic time based threshold.