On Tue, Jun 18, 2024 at 05:53:02PM GMT, Jesper Dangaard Brouer wrote: > > [...] > > > > - With the added thresholding code, a flush is only done if there is a > > > > significant number of pending updates in the relevant subtree. > > > > Choosing the ratelimited approach is intentionally ignoring a > > > > significant change in stats (although arguably it could be irrelevant > > > > stats). > > > > > > > > > > My production observations are that the thresholding code isn't limiting > > > the flushing in practice. > > > > > > > Here we need more production data. I remember you mentioned MEMCG_KMEM > > being used for most of the updates. Is it possible to get top 5 (or 10) > > most updated stats for your production environment? > > > > Do you have a method for obtaining these stats? > > Last time I used eBPF + bpftrace to extract these stats. The high rate > these updates occur, it caused production overload situations that SRE > noticed. So, I have to be careful when producing these stats for you. > > Also could you be more specific what code lines you want stats for? > I am looking for idx and val in __mod_memcg_state, __mod_memcg_lruvec_state and __count_memcg_events. Can you share your bpftrace trace (I remember you shared it, sorry for asking again) and I will reach out to bpf experts to see if we can further reduce the cost. > > > > > > > > - Reclaim code is an iterative process, so not updating the stats on > > > > every retry is very counterintuitive. We are retrying reclaim using > > > > the same stats and heuristics used by a previous iteration, > > > > essentially dismissing the effects of those previous iterations. > > > > > > > > - Indeterministic behavior like this one is very difficult to debug if > > > > it causes problems. The missing updates in the last 2s (or whatever > > > > period) could be of any magnitude. We may be ignoring GBs of > > > > free/allocated memory. What's worse is, if it causes any problems, > > > > tracing it back to this flush will be extremely difficult. > > > > > > > > > > The 2 sec seems like a long period for me. > > > > > > > What can we do? > > > > > > > > - Try to make more fundamental improvements to the flushing code (for > > > > memcgs or cgroups in general). The per-memcg flushing thresholding is > > > > an example of this. For example, if flushing is taking too long > > > > because we are flushing all subsystems, it may make sense to have > > > > separate rstat trees for separate subsystems. > > > > > > > > One other thing we can try is add a mutex in the memcg flushing path. > > > > I had initially had this in my subtree flushing series [1], but I > > > > dropped it as we thought it's not very useful. > > > > > > I'm running an experimental kernel with rstat lock converted to mutex on > > > a number of production servers, and we have not observed any regressions. > > > The kswapd thundering herd problem also happen on these machines, but as > > > these are sleep-able background threads, it is fine to sleep on the mutex. > > > > > > > Sorry but a global mutex which can be taken by userspace applications > > and is needed by node controller (to read stats) is a no from me. On a > > multi-tenant systems, global locks causing priority inversion is a real > > issue. > > > > The situation we have *today* with a global IRQ-disabling spin_lock is > precisely causing a priority-inversion situation always by design. > Userspace applications (reading stat file) and kswapd background > processes are currently getting higher priority than both hardware and > software interrupts. This is causing actual production issues, which is > why I'm working on this. > > I do understand that changing this to a global mutex creates the > theoretical *possibility* for priority-inversion between processes with Ah sorry for giving the impression that I am talking about theoretical issue, no, we (old me at Google) have seen global lock causing priority inversions and cauing the high priority node controller stuck for 10s of seconds to several minutes in Google production. > different configured priorities. IMHO this is better than always taking > the "big" bottom-half-lock [LWN]. I still want to lower the potential > priority-inversion issue with the mutex lock, by (1) working on lowering > the pressure on the lock (e.g. exit if flush is ongoing on root, and > e.g. add time limit on how often flush can run on sub-trees), and (2) we > will collect production metrics for lock contention and hold time (with > appropriate alerts). Oh I am open to mutex change along with the efforts to reduce the chances of priority inversion, just not the mutex change only. I am open to trying out new things as some (or all) might not work and we may have to go back to the older situation. > > > [LWN] https://lwn.net/SubscriberLink/978189/5f50cab8478fac45/ > > > > > > > [...] > > > > > > My pipe dream is that kernel can avoiding the cost of maintain the > > > cgroup threshold stats for flushing, and instead rely on a dynamic time > > > based threshold (in ms area) that have no fast-path overhead :-P > > > > > > > Please do expand on what you mean by dynamic time based threshold. > > I proposed a fixed 50 ms flush rate limiting in [2]. But I don't want > this to be a static value as it will vary on different configs and > hardware. I propose making this dynamic via measuring the time it takes > to flush the cgroup root. This makes sense in a queuing theory context, > because this corresponds to the (longest) service time, and theory say > "a queue is formed when customers arrive faster than they can get > served". Thus, this should lower the pressure/contention on the lock, > while still allowing frequent flushing. Hope it makes sense. > Thanks for the explanation. > --Jesper > > > [2] https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/