Re: Advice on cgroup rstat lock

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Tue, 16 Apr 2024 11:41:15 -0700

On Tue, Apr 16, 2024 at 04:22:51PM +0200, Jesper Dangaard Brouer wrote:

Sorry for the late response and I see there are patches posted as well
which I will take a look but let me put somethings in perspective.

> 
> 
> > 
> > I personally don't like mem_cgroup_flush_stats_ratelimited() very
> > much, because it is time-based (unlike memcg_vmstats_needs_flush()),
> > and a lot of changes can happen in a very short amount of time.
> > However, it seems like for some workloads it's a necessary evil :/
> > 

Other than obj_cgroup_may_zswap(), there is no other place which really
need very very accurate stats. IMO we should actually make ratelimited
version the default one for all the places. Stats will always be out of
sync for some time window even with non-ratelimited flush and I don't
see any place where 2 second old stat would be any issue.

> 
> I like the combination of the two mem_cgroup_flush_stats_ratelimited()
> and memcg_vmstats_needs_flush().
> IMHO the jiffies rate limit 2*FLUSH_TIME is too high, looks like 4 sec?

4 sec is the worst case and I don't think anyone have seen or reported
that they are seeing 4 sec delayed flush and if it is happening, it
seems like no one cares. 

> 
> 
> > I briefly looked into a global scheme similar to
> > memcg_vmstats_needs_flush() in core cgroups code, but I gave up
> > quickly. Different subsystems have different incomparable stats, so we
> > cannot have a simple magnitude of pending updates on a cgroup-level
> > that represents all subsystems fairly.
> > 
> > I tried to have per-subsystem callbacks to update the pending stats
> > and check if flushing is required -- but it got complicated quickly
> > and performance was bad.
> > 
> 
> I like the time-based limit because it doesn't require tracking pending
> updates.
> 
> I'm looking at using a time-based limit, on how often userspace can take
> the lock, but in the area of 50ms to 100 ms.

Sounds good to me and you might just need to check obj_cgroup_may_zswap
is not getting delayed or getting stale stats.

> 
> 
> With a mutex lock contention will be less obvious, as converting this to
> a mutex avoids multiple CPUs spinning while waiting for the lock, but
> it doesn't remove the lock contention.
> 

I don't like global sleepable locks as those are source of priority
inversion issues on highly utilized multi-tenant systems but I still
need to see how you are handling that.

> Userspace can easily triggered pressure on the global cgroup_rstat_lock
> via simply reading io.stat and cpu.stat files (under /sys/fs/cgroup/).
> I think we need a system to mitigate lock contention from userspace
> (waiting on code compiling with a proposal).  We see normal userspace
> stats tools like cadvisor, nomad (and systemd) trigger this by reading
> all the stat file on the system and even spawning parallel threads
> without realizing that kernel side they share same global lock.
> 
> You have done a huge effort to mitigate lock contention from memcg,
> thank you for that.  It would be sad if userspace reading these stat
> files can block memcg.  On production I see shrink_node having a
> congestion point happening on this global lock.

Seems like another instance where we should use the ratelimited version
of the flush function.