On Wed, Oct 5, 2022 at 9:30 AM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, > > On Tue, Oct 04, 2022 at 06:17:40PM -0700, Yosry Ahmed wrote: > > We have recently ran into a hard lockup on a machine with hundreds of > > CPUs and thousands of memcgs during an rstat flush. There have also > > been some discussions during LPC between myself, Michal Koutný, and > > Shakeel about memcg rstat flushing optimization. This email is a > > follow up on that, discussing possible ideas to optimize memcg rstat > > flushing. > > > > Currently, mem_cgroup_flush_stats() is the main interface to flush > > memcg stats. It has some internal optimizations that can skip a flush > > if there hasn't been significant updates in general. It always flushes > > the entire memcg hierarchy, and always invokes flushing using > > cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does > > not sleep. As you can imagine, with a sufficiently large number of > > memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or > > in an extreme case like the one we ran into, cause a hard lockup > > (despite periodically flushing every 4 seconds). > > How long were the stalls? Given that rstats are usually flushed by its I think 10 seconds while interrupts are disabled is what we need for a hard lockup, right? > consumers, flushing taking some time might be acceptable but what's really > problematic is that the whole thing is done with irq disabled. We can think > about other optimizations later too but I think the first thing to do is > making the flush code able to pause and resume. ie. flush in batches and > re-enable irq / resched between batches. We'd have to pay attention to > guaranteeing forward progress. It'd be ideal if we can structure iteration > in such a way that resuming doesn't end up nodes which got added after it > started flushing. IIUC you mean that the caller of cgroup_rstat_flush() can call a different variant that only flushes a part of the rstat tree then returns, and the caller makes several calls interleaved by re-enabling irq, right? Because the flushing code seems to already do this internally if the non irqsafe version is used. I think this might be tricky. In this case the path that caused the lockup was memcg_check_events()->mem_cgroup_threshold()->__mem_cgroup_threshold()->mem_cgroup_usage()->mem_cgroup_flush_stats(). Interrupts are disabled by callers of memcg_check_events(), but the rstat flush call is made much deeper in the call stack. Whoever is disabling interrupts doesn't have access to pause/resume flushing. There are also other code paths that used to use cgroup_rstat_flush_irqsafe() directly before mem_cgroup_flush_stats() was introduced like mem_cgroup_wb_stats() [1]. This is why I suggested a selective flushing variant of cgroup_rstat_flush_irqsafe(), so that flushers that need irq disabled have the ability to only flush a subset of the stats to avoid long stalls if possible. [1] https://lore.kernel.org/lkml/20211001190040.48086-2-shakeelb@xxxxxxxxxx/ > > Thanks. > > -- > tejun