On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > The Meta prod is seeing large amount of stalls in memcg stats flush > from the memcg reclaim code path. At the moment, this specific callsite > is doing a synchronous memcg stats flush. The rstat flush is an > expensive and time consuming operation, so concurrent relaimers will > busywait on the lock potentially for a long time. Actually this issue is > not unique to Meta and has been observed by Cloudflare [1] as well. For > the Cloudflare case, the stalls were due to contention between kswapd > threads running on their 8 numa node machines which does not make sense > as rstat flush is global and flush from one kswapd thread should be > sufficient for all. Simply replace the synchronous flush with the > ratelimited one. > > One may raise a concern on potentially using 2 sec stale (at worst) > stats for heuristics like desirable inactive:active ratio and preferring > inactive file pages over anon pages but these specific heuristics do not > require very precise stats and also are ignored under severe memory > pressure. > > More specifically for this code path, the stats are needed for two > specific heuristics: > > 1. Deactivate LRUs > 2. Cache trim mode > > The deactivate LRUs heuristic is to maintain a desirable inactive:active > ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* > and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to > check if there is a refault since last snapshot and the LRU size are > needed for the desirable ratio between inactive and active LRUs. See the > table below on how the desirable ratio is calculated. > > /* total target max > * memory ratio inactive > * ------------------------------------- > * 10MB 1 5MB > * 100MB 1 50MB > * 1GB 3 250MB > * 10GB 10 0.9GB > * 100GB 31 3GB > * 1TB 101 10GB > * 10TB 320 32GB > */ > > The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, > 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate > LRU size information to calculate this ratio. In addition, if > deactivation is skipped for some LRU, the kernel will force deactive on > the severe memory pressure situation. > > For the cache trim mode, inactive file LRU size is read and the kernel > scales it down based on the reclaim iteration (file >> sc->priority) and > only checks if it is zero or not. Again precise information is not > needed. > > This patch has been running on Meta fleet for several months and we have > not observed any issues. Please note that MGLRU is not impacted by this > issue at all as it avoids rstat flushing completely. > > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@xxxxxxxxxx [1] > Signed-off-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> Just curious, does Jesper's patch help with this problem? > --- > Changes since v1: > - Updated the commit message. > > mm/vmscan.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 008b62abf104..82318464cd5e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2282,10 +2282,11 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc) > target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); > > /* > - * Flush the memory cgroup stats, so that we read accurate per-memcg > - * lruvec stats for heuristics. > + * Flush the memory cgroup stats in rate-limited way as we don't need > + * most accurate stats here. We may switch to regular stats flushing > + * in the future once it is cheap enough. > */ > - mem_cgroup_flush_stats(sc->target_mem_cgroup); > + mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup); > > /* > * Determine the scan balance between anon and file LRUs. > -- > 2.43.5 >