Re: Advice on cgroup rstat lock

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Tue, 16 Apr 2024 16:22:51 +0200

On 12/04/2024 21.51, Yosry Ahmed wrote:
On Fri, Apr 12, 2024 at 12:26 PM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote:

On 11/04/2024 19.22, Yosry Ahmed wrote:
[..]

How far can we go... could cgroup_rstat_lock be converted to a mutex?
   >>>
The cgroup_rstat_lock was originally a mutex. It was converted to a
spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
a spinlock"). Irq was disabled to enable calling from atomic context.
Since commit 0a2dc6ac3329 ("cgroup: remove
cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
atomic context anymore. Theoretically, we could change it back to a
mutex or not disabling interrupt. That will require that the API cannot
be called from atomic context going forward.
   >>>
I think we should avoid flushing from atomic contexts going forward
anyway tbh. It's just too much work to do with IRQs disabled, and we
observed hard lockups before in worst case scenarios.

Appreciate the historic commits as documentation for how the code
evolved.  Sounds like we agree that the IRQ-disable can be lifted,
at-least between the three of us.

It can be lifted, but whether it should be or not is a different
story. I tried keeping it as a spinlock without disabling IRQs before
and Tejun pointed out possible problems, see below.

IMHO it *MUST* be lifted, as disabling IRQs here is hurting other parts
of the system and actual production systems.

The "offending" IRQ-spin_lock commit (0fa294fb1985) is from 2018, and
GitHub noticed in 2019 (via blog[1]) and at Red Hat I backported[2]
patches (which I now understand) only mitigate the issues.  Our prod
systems are on 6.1 and 6.6 where we still clearly see the issue
occurring.  Also Daniel's "rtla timerlat" tool for catching systems
latency issues have "cgroup_rstat_flush_locked" as the poster child [3][4].

We have been bitten by the IRQ-spinlock before, so I cannot disagree,
although for us removing atomic flushes and allowing the lock to be
dropped between CPU flushes seems to be good enough (for now).

   [1] https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/
   [2] https://bugzilla.redhat.com/show_bug.cgi?id=1795049
   [3] https://bristot.me/linux-scheduling-latency-debug-and-analysis/
   [4] Documentation/tools/rtla/rtla-timerlat-top.rst

I think one problem that was discussed before is that flushing is
exercised from multiple contexts and could have very high concurrency
(e.g. from reclaim when the system is under memory pressure). With a
mutex, the flusher could sleep with the mutex held and block other
threads for a while.

Fair point, so in first iteration we keep the spin_lock but don't do the
IRQ disable.

I tried doing that before, and Tejun had some objections:
https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@xxxxxxxxxxxxxxx/

My read of that thread is that Tejun would prefer we look into
converting cgroup_rsat_lock into a mutex again, or more aggressively
drop the lock on CPU boundaries. Perhaps we can unconditionally drop
the lock on each CPU boundary, but I am worried that contending the
lock too often may be an issue, which is why I suggested dropping the
lock if there are pending IRQs instead -- but I am not sure how to do
that :)

Like Tejun, I share the concern that keeping this a spinlock will
can increase the chance of several CPUs contend on this lock (which is
also a production issue we see).  This is why I suggested to "exit" if
(1) we see the lock have been taken by somebody else, or if (2) stats
were flushed recently.

When you say "exit", do you mean abort the whole thing, or just don't
spin for the lock but wait for the ongoing flush?

I like that we are considering a mutex lock, because it is not
reasonable to be waiting by spinning on the lock from remote CPUs,
because this cgroup_rstat_lock is held for too long (up to 64-128 ms in 
[prod]).

Prod latency data mentioned earlier:
 [prod] 
https://lore.kernel.org/all/ac4cf07f-52dd-454f-b897-2a4b3796a4d9@xxxxxxxxxx/

For (2), memcg have a mem_cgroup_flush_stats_ratelimited() system
combined with memcg_vmstats_needs_flush(), which limits the pressure on
the global lock (cgroup_rstat_lock).
*BUT* other users of cgroup_rstat_flush() like when reading io.stat
(blk-cgroup.c) and cpu.stat, don't have such a system to limit pressure
on global lock. Further more, userspace can easily trigger this via
reading those stat files.  And normal userspace stats tools (like
cadvisor, nomad, systemd) spawn threads reading io.stat, cpu.stat and
memory.stat, likely without realizing that kernel side they share same
global lock...

I'm working on a code solution/proposal for "ratelimiting" global lock
access when reading io.stat and cpu.stat.

I personally don't like mem_cgroup_flush_stats_ratelimited() very
much, because it is time-based (unlike memcg_vmstats_needs_flush()),
and a lot of changes can happen in a very short amount of time.
However, it seems like for some workloads it's a necessary evil :/

I like the combination of the two mem_cgroup_flush_stats_ratelimited()
and memcg_vmstats_needs_flush().
IMHO the jiffies rate limit 2*FLUSH_TIME is too high, looks like 4 sec?

I briefly looked into a global scheme similar to
memcg_vmstats_needs_flush() in core cgroups code, but I gave up
quickly. Different subsystems have different incomparable stats, so we
cannot have a simple magnitude of pending updates on a cgroup-level
that represents all subsystems fairly.

I tried to have per-subsystem callbacks to update the pending stats
and check if flushing is required -- but it got complicated quickly
and performance was bad.

I like the time-based limit because it doesn't require tracking pending
updates.

I'm looking at using a time-based limit, on how often userspace can take
the lock, but in the area of 50ms to 100 ms.

At some point, having different rstat trees for different subsystems
was brought up. I never looked into actually implementing it, but I
suppose if we do that we have a generic scheme similar to
memcg_vmstats_needs_flush() that can be customized by each subsystem
in a clean performant way? I am not sure.

[..]

I vaguely recall experimenting locally with changing that lock into a
mutex and not liking the results, but I can't remember much more. I
could be misremembering though.

Currently, the lock is dropped in cgroup_rstat_flush_locked() between
CPU iterations if rescheduling is needed or the lock is being
contended (i.e. spin_needbreak() returns true). I had always wondered
if it's possible to introduce a similar primitive for IRQs? We could
also drop the lock (and re-enable IRQs) if IRQs are pending then.

I am not sure if there is a way to check if a hardirq is pending, but we
do have a local_softirq_pending() helper.

The local_softirq_pending() might work well for me, as this is our prod
problem, that CPU local pending softirq's are getting starved.

If my understanding is correct, softirqs are usually scheduled by
IRQs, which means that local_softirq_pending() may return false if
there are pending IRQs (that will schedule softirqs). Is this correct?

Yes, networking hard IRQ will raise softirq, but software often also
raise softirq.
I see where you are going with this... the cgroup_rstat_flush_locked()
loop "play nice" check happens with IRQ lock held, so you speculate that
IRQ handler will not be able to raise softirq, thus
local_softirq_pending() will not work inside IRQ lock.

Exactly.

I wonder if it would be okay to just unconditionally drop the lock at
each CPU boundary. Would be interesting to experiment with this. One
disadvantage of the mutex in this case (imo) is that outside of the
percpu spinlock critical section, we don't really need to be holding
the global lock/mutex. So sleeping while holding it is not needed and
only introduces problems. Dropping the spinlock at each boundary seems
like a way to circumvent that.

This sound interesting, to unconditionally drop the lock at each CPU
boundary.  We should experiment with this.

If the problems you are observing are mainly on CPUs that are holding
the lock and flushing, I suspect this should greatly. If the problems
are mainly on CPUs spinning for the lock, I suspect it will still help
redistribute the lock (and IRQs disablement) more often, but not as
much.

In production another problematic (but rarely occurring issue) is when
several CPUs contend on this lock.  Yosry's recent work/patches have
already reduced the chances of this happening (thanks), BUT it still can
and does happen.
A simple solution to this, would be to do a spin_trylock() in
cgroup_rstat_flush(), and exit if we cannot get the lock, because we
know someone else will do the work.

I am not sure I understand what you mean specifically with the checks
below, but I generally don't like this (as you predicted :) ).

On the memcg side, we used to have similar logic when we used to
always flush the entire tree. This leaded to flushing being
indeterministic. You would occasionally get stale stats because of the
contention, which resulted in some inconsistencies (e.g. performing
proactive reclaim successfully then reading the stats that do not
reflect that).

Now that we dropped the logic to always flush the entire tree, it is
even more difficult because concurrent flushes could be in completely
irrelevant subtrees.

If we were to introduce some smart logic to figure out that the
subtree we are trying to flush is already being flushed, I think we
would need to wait for that ongoing flush to complete instead of just
returning (e.g. using completions). But I think such implementations
to find overlapping flushes and wait for them may be too compicated.

We will see if you hate my current code approach ;-)

Just to be clear, if the spinlock was to be converted to a mutex, or
to be dropped at each CPU boundary, do you still think such
ratelimiting is still needed to mitigate lock contention -- even if
the IRQs latency problem is fixed?

With a mutex lock contention will be less obvious, as converting this to
a mutex avoids multiple CPUs spinning while waiting for the lock, but
it doesn't remove the lock contention.

Userspace can easily triggered pressure on the global cgroup_rstat_lock
via simply reading io.stat and cpu.stat files (under /sys/fs/cgroup/).
I think we need a system to mitigate lock contention from userspace
(waiting on code compiling with a proposal).  We see normal userspace
stats tools like cadvisor, nomad (and systemd) trigger this by reading
all the stat file on the system and even spawning parallel threads
without realizing that kernel side they share same global lock.

You have done a huge effort to mitigate lock contention from memcg,
thank you for that.  It would be sad if userspace reading these stat
files can block memcg.  On production I see shrink_node having a
congestion point happening on this global lock.

--Jesper