On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote: > > This patch aims to reduce userspace-triggered pressure on the global > cgroup_rstat_lock by introducing a mechanism to limit how often reading > stat files causes cgroup rstat flushing. > > In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with > mem_cgroup_flush_stats_ratelimited() already limits pressure on the > global lock (cgroup_rstat_lock). As a result, reading memory-related stat > files (such as memory.stat, memory.numa_stat, zswap.current) is already > a less userspace-triggerable issue. > > However, other userspace users of cgroup_rstat_flush(), such as when > reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to > limit pressure on the global lock. Furthermore, userspace can easily > trigger this issue by reading those stat files. > > Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd) > spawn threads that read io.stat, cpu.stat, and memory.stat (even from the > same cgroup) without realizing that on the kernel side, they share the > same global lock. This limitation also helps prevent malicious userspace > applications from harming the kernel by reading these stat files in a > tight loop. > > To address this, the patch introduces cgroup_rstat_flush_ratelimited(), > similar to memcg's mem_cgroup_flush_stats_ratelimited(). > > Flushing occurs per cgroup (even though the lock remains global) a > variable named rstat_flush_last_time is introduced to track when a given > cgroup was last flushed. This variable, which contains the jiffies of the > flush, shares properties and a cache line with rstat_flush_next and is > updated simultaneously. > > For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold) > because other data is read under the lock, but we skip the expensive > flushing if it occurred recently. > > Regarding io.stat, there is an opportunity outside the lock to skip the > flush, but inside the lock, we must recheck to handle races. > > Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> As I mentioned in another thread, I really don't like time-based rate-limiting [1]. Would it be possible to generalize the magnitude-based rate-limiting instead? Have something like memcg_vmstats_needs_flush() in the core rstat code? Also, why do we keep the memcg time rate-limiting with this patch? Is it because we use a much larger window there (2s)? Having two layers of time-based rate-limiting is not ideal imo. [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@xxxxxxxxxxxxxx/