Re: [PATCH v2] memcg: use ratelimited stats flush in the reclaim

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
>
> The Meta prod is seeing large amount of stalls in memcg stats flush
> from the memcg reclaim code path. At the moment, this specific callsite
> is doing a synchronous memcg stats flush. The rstat flush is an
> expensive and time consuming operation, so concurrent relaimers will
> busywait on the lock potentially for a long time. Actually this issue is
> not unique to Meta and has been observed by Cloudflare [1] as well. For
> the Cloudflare case, the stalls were due to contention between kswapd
> threads running on their 8 numa node machines which does not make sense
> as rstat flush is global and flush from one kswapd thread should be
> sufficient for all. Simply replace the synchronous flush with the
> ratelimited one.
>
> One may raise a concern on potentially using 2 sec stale (at worst)
> stats for heuristics like desirable inactive:active ratio and preferring
> inactive file pages over anon pages but these specific heuristics do not
> require very precise stats and also are ignored under severe memory
> pressure.
>
> More specifically for this code path, the stats are needed for two
> specific heuristics:
>
> 1. Deactivate LRUs
> 2. Cache trim mode
>
> The deactivate LRUs heuristic is to maintain a desirable inactive:active
> ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE*
> and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to
> check if there is a refault since last snapshot and the LRU size are
> needed for the desirable ratio between inactive and active LRUs. See the
> table below on how the desirable ratio is calculated.
>
> /* total     target    max
>  * memory    ratio     inactive
>  * -------------------------------------
>  *   10MB       1         5MB
>  *  100MB       1        50MB
>  *    1GB       3       250MB
>  *   10GB      10       0.9GB
>  *  100GB      31         3GB
>  *    1TB     101        10GB
>  *   10TB     320        32GB
>  */
>
> The desirable ratio only changes at the boundary of 1 GiB, 10 GiB,
> 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate
> LRU size information to calculate this ratio. In addition, if
> deactivation is skipped for some LRU, the kernel will force deactive on
> the severe memory pressure situation.
>
> For the cache trim mode, inactive file LRU size is read and the kernel
> scales it down based on the reclaim iteration (file >> sc->priority) and
> only checks if it is zero or not. Again precise information is not
> needed.
>
> This patch has been running on Meta fleet for several months and we have
> not observed any issues. Please note that MGLRU is not impacted by this
> issue at all as it avoids rstat flushing completely.
>
> Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@xxxxxxxxxx [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>

Just curious, does Jesper's patch help with this problem?

> ---
> Changes since v1:
> - Updated the commit message.
>
>  mm/vmscan.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 008b62abf104..82318464cd5e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2282,10 +2282,11 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
>         target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>
>         /*
> -        * Flush the memory cgroup stats, so that we read accurate per-memcg
> -        * lruvec stats for heuristics.
> +        * Flush the memory cgroup stats in rate-limited way as we don't need
> +        * most accurate stats here. We may switch to regular stats flushing
> +        * in the future once it is cheap enough.
>          */
> -       mem_cgroup_flush_stats(sc->target_mem_cgroup);
> +       mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup);
>
>         /*
>          * Determine the scan balance between anon and file LRUs.
> --
> 2.43.5
>





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux