Re: [PATCH V10] cgroup/rstat: Avoid flushing if there is an ongoing root flush

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 4 Sep 2024 14:10:03 -0700

On Wed, Sep 4, 2024 at 12:41 PM Jesper Dangaard Brouer <hawk@xxxxxxxxxx> wrote:
>
> This patch reintroduces and generalizes the "stats_flush_ongoing" concept to
> avoid redundant flushes if there is an ongoing flush at cgroup root level,
> addressing production lock contention issues on the global cgroup rstat lock.
>
> At Cloudflare, we observed significant performance degradation due to
> lock contention on the rstat lock, primarily caused by kswapd. The
> specific mem_cgroup_flush_stats() call inlined in shrink_node, which
> takes the rstat lock, is particularly problematic.
>
> On our 12 NUMA node machines, each with a kswapd kthread per NUMA node, we
> noted severe lock contention on the rstat lock, causing 12 CPUs to waste
> cycles spinning every time kswapd runs. Fleet-wide stats (/proc/N/schedstat)
> for kthreads revealed that we are burning an average of 20,000 CPU cores
> fleet-wide on kswapd, primarily due to spinning on the rstat lock.
>
> Here's a brief overview of the issue:
> - __alloc_pages_slowpath calls wake_all_kswapds, causing all kswapdN threads
>   to wake up simultaneously.
> - The kswapd thread invokes shrink_node (via balance_pgdat), triggering the
>   cgroup rstat flush operation as part of its work.
> - balance_pgdat() has a NULL value in target_mem_cgroup, causing
>   mem_cgroup_flush_stats() to flush with root_mem_cgroup.
>
> The kernel previously addressed this with a "stats_flush_ongoing" concept,
> which was removed in commit 7d7ef0a4686a ("mm: memcg: restore subtree stats
> flushing"). This patch reintroduces and generalizes the concept to apply to
> all users of cgroup rstat, not just memcg.
>
> In this patch only a root cgroup can become the ongoing flusher, as this solves
> the production issue. Letting other levels becoming ongoing flusher cause root
> cgroup to contend on the lock again.
>
> Some in-kernel users of the cgroup_rstat_flush() API depend on waiting for the
> flush to complete before continuing. This patch introduce the call
> cgroup_rstat_flush_relaxed() and use it in those cases that can live with
> slightly inaccurate flushes.
>
> This change significantly reduces lock contention, especially in
> environments with multiple NUMA nodes, thereby improving overall system
> performance.
>
> Fixes: 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing").
> Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>

I am honestly not happy with this patch, see below.

> ---
>

Please include a brief delta vs the previous version here to save
reviewers the duplicated effort of figuring it out.

> V9: https://lore.kernel.org/all/172245504313.3147408.12138439169548255896.stgit@firesoul/
> V8: https://lore.kernel.org/all/172139415725.3084888.13770938453137383953.stgit@firesoul
> V7: https://lore.kernel.org/all/172070450139.2992819.13210624094367257881.stgit@firesoul
> V6: https://lore.kernel.org/all/172052399087.2357901.4955042377343593447.stgit@firesoul/
> V5: https://lore.kernel.org/all/171956951930.1897969.8709279863947931285.stgit@firesoul/
> V4: https://lore.kernel.org/all/171952312320.1810550.13209360603489797077.stgit@firesoul/
> V3: https://lore.kernel.org/all/171943668946.1638606.1320095353103578332.stgit@firesoul/
> V2: https://lore.kernel.org/all/171923011608.1500238.3591002573732683639.stgit@firesoul/
> V1: https://lore.kernel.org/all/171898037079.1222367.13467317484793748519.stgit@firesoul/
> RFC: https://lore.kernel.org/all/171895533185.1084853.3033751561302228252.stgit@firesoul/
[..]
> @@ -299,6 +301,67 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>         spin_unlock_irq(&cgroup_rstat_lock);
>  }
>
> +static inline bool cgroup_is_root(struct cgroup *cgrp)
> +{
> +       return cgroup_parent(cgrp) == NULL;
> +}
> +
> +/**
> + * cgroup_rstat_trylock_flusher - Trylock that checks for on ongoing flusher
> + * @cgrp: target cgroup
> + *
> + * Function return value follow trylock semantics. Returning true when lock is
> + * obtained. Returning false when not locked and it detected flushing can be
> + * skipped as another ongoing flusher is taking care of the flush.
> + *
> + * For callers that depend on flush completing before returning a strict option
> + * is provided.
> + */
> +static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp, bool strict)
> +{
> +       struct cgroup *ongoing;
> +
> +       if (strict)
> +               goto lock;
> +
> +       /*
> +        * Check if ongoing flusher is already taking care of this.  Descendant
> +        * check is necessary due to cgroup v1 supporting multiple root's.
> +        */
> +       ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
> +       if (ongoing && cgroup_is_descendant(cgrp, ongoing))
> +               return false;

Why did we drop the agreed upon method of waiting until the flushers
are done? This is now a much more intrusive patch which makes all
flushers skip if a root is currently flushing. This causes
user-visible problems and is something that I worked hard to fix. I
thought we got good results with waiting for the ongoing flusher as
long as it is a root? What changed?

You also never addressed my concern here about 'ongoing' while we are
accessing it, and never responded to my question in v8 about expanding
this to support non-root cgroups once we shift to a mutex.

I don't appreciate the silent yet drastic change made in this version
and without addressing concerns raised in previous versions. Please
let me know if I missed something.

> +
> +       /* Grab right to be ongoing flusher */
> +       if (!ongoing && cgroup_is_root(cgrp)) {
> +               struct cgroup *old;
> +
> +               old = cmpxchg(&cgrp_rstat_ongoing_flusher, NULL, cgrp);
> +               if (old) {
> +                       /* Lost race for being ongoing flusher */
> +                       if (cgroup_is_descendant(cgrp, old))
> +                               return false;
> +               }
> +               /* Due to lock yield combined with strict mode record ID */
> +               WRITE_ONCE(cgrp_rstat_ongoing_flusher_ID, current);

I am not sure I understand why we need this, do you mind elaborating?