On Thu, Apr 13, 2023 at 3:40 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > We keep track of different types of reclaimed pages through > reclaim_state->reclaimed_slab, and we add them to the reported number > of reclaimed pages. For non-memcg reclaim, this makes sense. For memcg > reclaim, we have no clue if those pages are charged to the memcg under > reclaim. > > Slab pages are shared by different memcgs, so a freed slab page may have > only been partially charged to the memcg under reclaim. The same goes for > clean file pages from pruned inodes (on highmem systems) or xfs buffer > pages, there is no simple way to currently link them to the memcg under > reclaim. > > Stop reporting those freed pages as reclaimed pages during memcg reclaim. > This should make the return value of writing to memory.reclaim, and may > help reduce unnecessary reclaim retries during memcg charging. Writing to > memory.reclaim on the root memcg is considered as cgroup_reclaim(), but > for this case we want to include any freed pages, so use the > global_reclaim() check instead of !cgroup_reclaim(). > > Generally, this should make the return value of > try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g. > freed a slab page that was mostly charged to the memcg under reclaim), > the return value of try_to_free_mem_cgroup_pages() can be underestimated, > but this should be fine. The freed pages will be uncharged anyway, and we > can charge the memcg the next time around as we usually do memcg reclaim > in a retry loop. > > Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects > instead of pages") Andrew, I removed the CC: stable as you were sceptical about the need for a backport, but left the Fixes tag so that it's easy to identify where to backport it if you and/or stable maintainers decide otherwise. > > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > --- > mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 42 insertions(+), 7 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9c1c5e8b24b8..be657832be48 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -511,6 +511,46 @@ static bool writeback_throttling_sane(struct scan_control *sc) > } > #endif > > +/* > + * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to > + * scan_control->nr_reclaimed. > + */ > +static void flush_reclaim_state(struct scan_control *sc) > +{ > + /* > + * Currently, reclaim_state->reclaimed includes three types of pages > + * freed outside of vmscan: > + * (1) Slab pages. > + * (2) Clean file pages from pruned inodes (on highmem systems). > + * (3) XFS freed buffer pages. > + * > + * For all of these cases, we cannot universally link the pages to a > + * single memcg. For example, a memcg-aware shrinker can free one object > + * charged to the target memcg, causing an entire page to be freed. > + * If we count the entire page as reclaimed from the memcg, we end up > + * overestimating the reclaimed amount (potentially under-reclaiming). > + * > + * Only count such pages for global reclaim to prevent under-reclaiming > + * from the target memcg; preventing unnecessary retries during memcg > + * charging and false positives from proactive reclaim. > + * > + * For uncommon cases where the freed pages were actually mostly > + * charged to the target memcg, we end up underestimating the reclaimed > + * amount. This should be fine. The freed pages will be uncharged > + * anyway, even if they are not counted here properly, and we will be > + * able to make forward progress in charging (which is usually in a > + * retry loop). > + * > + * We can go one step further, and report the uncharged objcg pages in > + * memcg reclaim, to make reporting more accurate and reduce > + * underestimation, but it's probably not worth the complexity for now. > + */ > + if (current->reclaim_state && global_reclaim(sc)) { > + sc->nr_reclaimed += current->reclaim_state->reclaimed; > + current->reclaim_state->reclaimed = 0; > + } > +} > + > static long xchg_nr_deferred(struct shrinker *shrinker, > struct shrink_control *sc) > { > @@ -5346,8 +5386,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) > vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned, > sc->nr_reclaimed - reclaimed); > > - sc->nr_reclaimed += current->reclaim_state->reclaimed_slab; > - current->reclaim_state->reclaimed_slab = 0; > + flush_reclaim_state(sc); > > return success ? MEMCG_LRU_YOUNG : 0; > } > @@ -6450,7 +6489,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > { > - struct reclaim_state *reclaim_state = current->reclaim_state; > unsigned long nr_reclaimed, nr_scanned; > struct lruvec *target_lruvec; > bool reclaimable = false; > @@ -6472,10 +6510,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > > shrink_node_memcgs(pgdat, sc); > > - if (reclaim_state) { > - sc->nr_reclaimed += reclaim_state->reclaimed_slab; > - reclaim_state->reclaimed_slab = 0; > - } > + flush_reclaim_state(sc); > > /* Record the subtree's reclaim efficiency */ > if (!sc->proactive) > -- > 2.40.0.577.gac1e443424-goog >