Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Thu, 18 Apr 2024 13:00:30 +0200

On 18/04/2024 04.21, Yosry Ahmed wrote:
On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote:

This patch aims to reduce userspace-triggered pressure on the global
cgroup_rstat_lock by introducing a mechanism to limit how often reading
stat files causes cgroup rstat flushing.

In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
mem_cgroup_flush_stats_ratelimited() already limits pressure on the
global lock (cgroup_rstat_lock). As a result, reading memory-related stat
files (such as memory.stat, memory.numa_stat, zswap.current) is already
a less userspace-triggerable issue.

However, other userspace users of cgroup_rstat_flush(), such as when
reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
limit pressure on the global lock. Furthermore, userspace can easily
trigger this issue by reading those stat files.

Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
same cgroup) without realizing that on the kernel side, they share the
same global lock. This limitation also helps prevent malicious userspace
applications from harming the kernel by reading these stat files in a
tight loop.

To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
similar to memcg's mem_cgroup_flush_stats_ratelimited().

Flushing occurs per cgroup (even though the lock remains global) a
variable named rstat_flush_last_time is introduced to track when a given
cgroup was last flushed. This variable, which contains the jiffies of the
flush, shares properties and a cache line with rstat_flush_next and is
updated simultaneously.

For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
because other data is read under the lock, but we skip the expensive
flushing if it occurred recently.

Regarding io.stat, there is an opportunity outside the lock to skip the
flush, but inside the lock, we must recheck to handle races.

Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>

As I mentioned in another thread, I really don't like time-based
rate-limiting [1]. Would it be possible to generalize the
magnitude-based rate-limiting instead? Have something like
memcg_vmstats_needs_flush() in the core rstat code?

I've taken a closer look at memcg_vmstats_needs_flush(). And I'm
concerned about overhead maintaining the stats (that is used as a filter).

  static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
  {
	return atomic64_read(&vmstats->stats_updates) >
		MEMCG_CHARGE_BATCH * num_online_cpus();
  }

I looked at `vmstats->stats_updates` to see how often this is getting 
updated.  It is updated in memcg_rstat_updated(), but it gets inlined 
into a number function (__mod_memcg_state, __mod_memcg_lruvec_state, 
__count_memcg_events), plus it calls cgroup_rstat_updated().
Counting invocations per sec (via funccount):

  10:28:09
  FUNC                                    COUNT
  __mod_memcg_state                      377553
  __count_memcg_events                   393078
  __mod_memcg_lruvec_state              1229673
  cgroup_rstat_updated                  2632389

I'm surprised to see how many time per sec this is getting invoked.
Originating from memcg_rstat_updated() = 2,000,304 times per sec.
(On a 128 CPU core machine with 39% idle CPU-load.)
Maintaining these stats seems excessive...

Then how often does the filter lower pressure on lock:

  MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192
  2000304/(64*128) = 244 time per sec (every ~4ms)
  (assuming memcg_rstat_updated val=1)

Also, why do we keep the memcg time rate-limiting with this patch? Is
it because we use a much larger window there (2s)? Having two layers
of time-based rate-limiting is not ideal imo.

I'm also not-a-fan of having two layer of time-based rate-limiting, but 
they do operate a different time scales *and* are not active at the same 
time with current patch, if you noticed the details, then I excluded 
memcg from using this as I commented "memcg have own ratelimit layer" 
(in do_flush_stats).

I would prefer removing the memcg time rate-limiting and use this more 
granular 50ms (20 timer/sec) for memcg also.  And I was planning to do 
that in a followup patchset.  The 50ms (20 timer/sec) limit will be per 
cgroup in the system, which then "scales"/increase with the number of 
cgroups, but better than unbounded read/access locks per sec.

--Jesper

[1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@xxxxxxxxxxxxxx/

sudo ./bcc/tools/funccount -Ti 1 -d 10 
'__mod_memcg_state|__mod_memcg_lruvec_state|__count_memcg_events|cgroup_rstat_updated'