[RFC] memcg rstat flushing optimization

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 4 Oct 2022 18:17:40 -0700

Hey everyone,

Sorry for the long email :)

We have recently ran into a hard lockup on a machine with hundreds of
CPUs and thousands of memcgs during an rstat flush. There have also
been some discussions during LPC between myself, Michal Koutný, and
Shakeel about memcg rstat flushing optimization. This email is a
follow up on that, discussing possible ideas to optimize memcg rstat
flushing.

Currently, mem_cgroup_flush_stats() is the main interface to flush
memcg stats. It has some internal optimizations that can skip a flush
if there hasn't been significant updates in general. It always flushes
the entire memcg hierarchy, and always invokes flushing using
cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does
not sleep. As you can imagine, with a sufficiently large number of
memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or
in an extreme case like the one we ran into, cause a hard lockup
(despite periodically flushing every 4 seconds).

(a) A first step might be to introduce a non _irqsafe version of
mem_cgroup_flush_stats(), and only call the _irqsafe version in places
where we can't sleep. This will exclude some contexts from possibly
introducing a lockup, like the stats reading context and the periodic
flushing context.

(b) We can also stop flushing the entire memcg hierarchy in hopes that
flushing might happen incrementally over subtrees, but this was
introduced to reduce lock contention when there are multiple contexts
trying to flush memcgs stats concurrently, where only one of them will
flush and all the others return immediately (although there is some
inaccuracy here as we didn't actually wait for the flush to complete).
This will re-introduce the lock contention. Maybe we can mitigate this
in rstat code by having hierarchical locks instead of a global lock,
although I can imagine this can quickly get too complicated.

(c) One other thing we can do (similar to the recent blkcg patch
series [1]) is keep track of which stats have been updated. We
currently flush MEMCG_NR_STATS + MEMCG_NR_EVENTS (thanks to Shakeel) +
nodes * NR_VM_NODE_STAT_ITEMS. I didn't make the exact calculation but
I suspect this easily goes over a 100. Keeping track of updated stats
might be in the form of a percpu bitmask. It will introduce some
overhead to the update side and flush sides, but it can help us skip a
lot of up-to-date stats and cache misses. In a few sample machines I
have found that every (memcg, cpu) pair had less than 5 stats on
average that are actually updated.

(d) Instead of optimizing rstat flushing in general, we can just
mitigate the cases that can actually cause a lockup. After we do (a)
and separate call sites that actually need to disable interrupts, we
can introduce a new selective flush callback (e.g.
cgroup_rstat_flush_opts()). This callback can flush only the stats we
care about (bitmask?) and leave the rstat tree untouched (only
traverse the tree, don't pop the nodes). It might be less than optimal
in cases where the stats we choose to flush are the only ones that are
updated, and the cgroup just remains on the rstat tree for no reason.
However, it effectively addresses the cases that can cause a lockup by
only flushing a small subset of the stats.

(e) If we do both (c) and (d), we can go one step further. We can make
cgroup_rstat_flush_opts() return a boolean to indicate whether this
cgroup is completely flushed (what we asked to flush is all what was
updated). If true, we can remove the cgroup from the rstat tree.
However, to do this we will need to have separate rstat trees for each
subsystem or to keep track of which subsystems have updates for a
cgroup (so that if cgroup_rstat_flush_opts() returns true we know if
we can remove the cgroup from the tree or not).

Of course nothing is free. Most of the solutions above will either
introduce overhead somewhere, complexity, or both. We also don't have
a de facto benchmark that will tell us for sure if a change made
things generally better or not, as it will vastly differ depending on
the setup, the workloads, etc. Nothing will make everything better for
all use cases. This is just me kicking off a discussion to see what we
can/should do :)

[1] https://lore.kernel.org/lkml/20221004151748.293388-1-longman@xxxxxxxxxx/