Hello, On Tue, Oct 04, 2022 at 06:17:40PM -0700, Yosry Ahmed wrote: > We have recently ran into a hard lockup on a machine with hundreds of > CPUs and thousands of memcgs during an rstat flush. There have also > been some discussions during LPC between myself, Michal Koutný, and > Shakeel about memcg rstat flushing optimization. This email is a > follow up on that, discussing possible ideas to optimize memcg rstat > flushing. > > Currently, mem_cgroup_flush_stats() is the main interface to flush > memcg stats. It has some internal optimizations that can skip a flush > if there hasn't been significant updates in general. It always flushes > the entire memcg hierarchy, and always invokes flushing using > cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does > not sleep. As you can imagine, with a sufficiently large number of > memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or > in an extreme case like the one we ran into, cause a hard lockup > (despite periodically flushing every 4 seconds). How long were the stalls? Given that rstats are usually flushed by its consumers, flushing taking some time might be acceptable but what's really problematic is that the whole thing is done with irq disabled. We can think about other optimizations later too but I think the first thing to do is making the flush code able to pause and resume. ie. flush in batches and re-enable irq / resched between batches. We'd have to pay attention to guaranteeing forward progress. It'd be ideal if we can structure iteration in such a way that resuming doesn't end up nodes which got added after it started flushing. Thanks. -- tejun