On Tue, Aug 13, 2024 at 02:58:51PM GMT, Yosry Ahmed wrote: > On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > > > The Meta prod is seeing large amount of stalls in memcg stats flush > > from the memcg reclaim code path. At the moment, this specific callsite > > is doing a synchronous memcg stats flush. The rstat flush is an > > expensive and time consuming operation, so concurrent relaimers will > > busywait on the lock potentially for a long time. Actually this issue is > > not unique to Meta and has been observed by Cloudflare [1] as well. For > > the Cloudflare case, the stalls were due to contention between kswapd > > threads running on their 8 numa node machines which does not make sense > > as rstat flush is global and flush from one kswapd thread should be > > sufficient for all. Simply replace the synchronous flush with the > > ratelimited one. > > > > One may raise a concern on potentially using 2 sec stale (at worst) > > stats for heuristics like desirable inactive:active ratio and preferring > > inactive file pages over anon pages but these specific heuristics do not > > require very precise stats and also are ignored under severe memory > > pressure. > > > > More specifically for this code path, the stats are needed for two > > specific heuristics: > > > > 1. Deactivate LRUs > > 2. Cache trim mode > > > > The deactivate LRUs heuristic is to maintain a desirable inactive:active > > ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* > > and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to > > check if there is a refault since last snapshot and the LRU size are > > needed for the desirable ratio between inactive and active LRUs. See the > > table below on how the desirable ratio is calculated. > > > > /* total target max > > * memory ratio inactive > > * ------------------------------------- > > * 10MB 1 5MB > > * 100MB 1 50MB > > * 1GB 3 250MB > > * 10GB 10 0.9GB > > * 100GB 31 3GB > > * 1TB 101 10GB > > * 10TB 320 32GB > > */ > > > > The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, > > 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate > > LRU size information to calculate this ratio. In addition, if > > deactivation is skipped for some LRU, the kernel will force deactive on > > the severe memory pressure situation. > > > > For the cache trim mode, inactive file LRU size is read and the kernel > > scales it down based on the reclaim iteration (file >> sc->priority) and > > only checks if it is zero or not. Again precise information is not > > needed. > > > > This patch has been running on Meta fleet for several months and we have > > not observed any issues. Please note that MGLRU is not impacted by this > > issue at all as it avoids rstat flushing completely. > > > > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@xxxxxxxxxx [1] > > Signed-off-by: Shakeel Butt <shakeel.butt@xxxxxxxxx> > > Just curious, does Jesper's patch help with this problem? If you are asking if I have tested Jesper's patch in Meta's production then no, I have not tested it. Also I have not taken a look at the latest from Jesper as I was stuck in some other issues.