On Wed, Sep 4, 2024 at 12:41 PM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote: > > This patch reintroduces and generalizes the "stats_flush_ongoing" concept to > avoid redundant flushes if there is an ongoing flush at cgroup root level, > addressing production lock contention issues on the global cgroup rstat lock. > > At Cloudflare, we observed significant performance degradation due to > lock contention on the rstat lock, primarily caused by kswapd. The > specific mem_cgroup_flush_stats() call inlined in shrink_node, which > takes the rstat lock, is particularly problematic. > > On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, we > noted severe lock contention on the rstat lock, causing 12 CPUs to waste > cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedstat) > for kthreads revealed that we are burning an average of 20,000 CPU cores > fleet-wide on kswapd, primarily due to spinning on the rstat lock. > > Here's a brief overview of the issue: > - __alloc_pages_slowpath calls wake_all_kswapds, causing all kswapdN threads > to wake up simultaneously. > - The kswapd thread invokes shrink_node (via balance_pgdat), triggering the > cgroup rstat flush operation as part of its work. > - balance_pgdat() has a NULL value in target_mem_cgroup, causing > mem_cgroup_flush_stats() to flush with root_mem_cgroup. > > The kernel previously addressed this with a "stats_flush_ongoing" concept, > which was removed in commit 7d7ef0a4686a ("mm: memcg: restore subtree stats > flushing"). This patch reintroduces and generalizes the concept to apply to > all users of cgroup rstat, not just memcg. > > In this patch only a root cgroup can become the ongoing flusher, as this solves > the production issue. Letting other levels becoming ongoing flusher cause root > cgroup to contend on the lock again. > > Some in-kernel users of the cgroup_rstat_flush() API depend on waiting for the > flush to complete before continuing. This patch introduce the call > cgroup_rstat_flush_relaxed() and use it in those cases that can live with > slightly inaccurate flushes. > > This change significantly reduces lock contention, especially in > environments with multiple NUMA nodes, thereby improving overall system > performance. > > Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing"). > Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> I am honestly not happy with this patch, see below. > --- > Please include a brief delta vs the previous version here to save reviewers the duplicated effort of figuring it out. > V9: https://lore.kernel.org/all/172245504313.3147408.12138439169548255896.stgit@firesoul/ > V8: https://lore.kernel.org/all/172139415725.3084888.13770938453137383953.stgit@firesoul > V7: https://lore.kernel.org/all/172070450139.2992819.13210624094367257881.stgit@firesoul > V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/ > V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/ > V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/ > V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/ > V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/ > V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/ > RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/ [..] > @@ -299,6 +301,67 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) > spin_unlock_irq(&cgroup_rstat_lock); > } > > +static inline bool cgroup_is_root(struct cgroup *cgrp) > +{ > + return cgroup_parent(cgrp) == NULL; > +} > + > +/** > + * cgroup_rstat_trylock_flusher - Trylock that checks for on ongoing flusher > + * @cgrp: target cgroup > + * > + * Function return value follow trylock semantics. Returning true when lock is > + * obtained. Returning false when not locked and it detected flushing can be > + * skipped as another ongoing flusher is taking care of the flush. > + * > + * For callers that depend on flush completing before returning a strict option > + * is provided. > + */ > +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp, bool strict) > +{ > + struct cgroup *ongoing; > + > + if (strict) > + goto lock; > + > + /* > + * Check if ongoing flusher is already taking care of this. Descendant > + * check is necessary due to cgroup v1 supporting multiple root's. > + */ > + ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher); > + if (ongoing && cgroup_is_descendant(cgrp, ongoing)) > + return false; Why did we drop the agreed upon method of waiting until the flushers are done? This is now a much more intrusive patch which makes all flushers skip if a root is currently flushing. This causes user-visible problems and is something that I worked hard to fix. I thought we got good results with waiting for the ongoing flusher as long as it is a root? What changed? You also never addressed my concern here about 'ongoing' while we are accessing it, and never responded to my question in v8 about expanding this to support non-root cgroups once we shift to a mutex. I don't appreciate the silent yet drastic change made in this version and without addressing concerns raised in previous versions. Please let me know if I missed something. > + > + /* Grab right to be ongoing flusher */ > + if (!ongoing && cgroup_is_root(cgrp)) { > + struct cgroup *old; > + > + old = cmpxchg(&cgrp_rstat_ongoing_flusher, NULL, cgrp); > + if (old) { > + /* Lost race for being ongoing flusher */ > + if (cgroup_is_descendant(cgrp, old)) > + return false; > + } > + /* Due to lock yield combined with strict mode record ID */ > + WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, current); I am not sure I understand why we need this, do you mind elaborating?