On Thu, Apr 6, 2023 at 1:45 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > Hi, Yosry, > > On Wed, Apr 05, 2023 at 06:54:27PM +0000, Yosry Ahmed wrote: > > [...] > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index c82bd89f90364..049e39202e6ce 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -188,18 +188,6 @@ struct scan_control { > > */ > > int vm_swappiness = 60; > > > > -static void set_task_reclaim_state(struct task_struct *task, > > - struct reclaim_state *rs) > > -{ > > - /* Check for an overwrite */ > > - WARN_ON_ONCE(rs && task->reclaim_state); > > - > > - /* Check for the nulling of an already-nulled member */ > > - WARN_ON_ONCE(!rs && !task->reclaim_state); > > - > > - task->reclaim_state = rs; > > -} > > - > > LIST_HEAD(shrinker_list); > > DECLARE_RWSEM(shrinker_rwsem); > > > > @@ -511,6 +499,59 @@ static bool writeback_throttling_sane(struct scan_control *sc) > > } > > #endif > > > > +static void set_task_reclaim_state(struct task_struct *task, > > + struct reclaim_state *rs) > > +{ > > + /* Check for an overwrite */ > > + WARN_ON_ONCE(rs && task->reclaim_state); > > + > > + /* Check for the nulling of an already-nulled member */ > > + WARN_ON_ONCE(!rs && !task->reclaim_state); > > + > > + task->reclaim_state = rs; > > +} > > Nit: I just think such movement not necessary while it loses the "git > blame" information easily. > > Instead of moving this here without major benefit, why not just define > flush_reclaim_state() right after previous set_task_reclaim_state()? An earlier version did that, but we would have to add a forward declaration of global_reclaim() (or cgroup_reclaim()), as they are defined after the previous position of set_task_reclaim_state(). > > > + > > +/* > > + * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to > > + * scan_control->nr_reclaimed. > > + */ > > +static void flush_reclaim_state(struct scan_control *sc, > > + struct reclaim_state *rs) > > +{ > > + /* > > + * Currently, reclaim_state->reclaimed includes three types of pages > > + * freed outside of vmscan: > > + * (1) Slab pages. > > + * (2) Clean file pages from pruned inodes. > > + * (3) XFS freed buffer pages. > > + * > > + * For all of these cases, we have no way of finding out whether these > > + * pages were related to the memcg under reclaim. For example, a freed > > + * slab page could have had only a single object charged to the memcg > > + * under reclaim. Also, populated inodes are not on shrinker LRUs > > + * anymore except on highmem systems. > > + * > > + * Instead of over-reporting the reclaimed pages in a memcg reclaim, > > + * only count such pages in global reclaim. This prevents unnecessary > > + * retries during memcg charging and false positive from proactive > > + * reclaim (memory.reclaim). > > + * > > + * For uncommon cases were the freed pages were actually significantly > > + * charged to the memcg under reclaim, and we end up under-reporting, it > > + * should be fine. The freed pages will be uncharged anyway, even if > > + * they are not reported properly, and we will be able to make forward > > + * progress in charging (which is usually in a retry loop). > > + * > > + * We can go one step further, and report the uncharged objcg pages in > > + * memcg reclaim, to make reporting more accurate and reduce > > + * under-reporting, but it's probably not worth the complexity for now. > > + */ > > + if (rs && global_reclaim(sc)) { > > + sc->nr_reclaimed += rs->reclaimed; > > + rs->reclaimed = 0; > > + } > > +} > > + > > static long xchg_nr_deferred(struct shrinker *shrinker, > > struct shrink_control *sc) > > { > > @@ -5346,10 +5387,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) > > vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned, > > sc->nr_reclaimed - reclaimed); > > > > - if (global_reclaim(sc)) { > > - sc->nr_reclaimed += current->reclaim_state->reclaimed_slab; > > - current->reclaim_state->reclaimed_slab = 0; > > - } > > + flush_reclaim_state(sc, current->reclaim_state); > > > > return success ? MEMCG_LRU_YOUNG : 0; > > } > > @@ -6474,10 +6512,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > > > > shrink_node_memcgs(pgdat, sc); > > > > - if (reclaim_state && global_reclaim(sc)) { > > - sc->nr_reclaimed += reclaim_state->reclaimed_slab; > > - reclaim_state->reclaimed_slab = 0; > > - } > > + flush_reclaim_state(sc, reclaim_state); > > IIUC reclaim_state here still points to current->reclaim_state. Could it > change at all? > > Is it cleaner to make flush_reclaim_state() taking "sc" only if it always > references current->reclaim_state? Good point. I think it's always current->reclaim_state. I think we can make flush_reclaim_state() only take "sc" as an argument, and remove the "reclaim_state" local variable in shrink_node() completely. > > > > > /* Record the subtree's reclaim efficiency */ > > if (!sc->proactive) > > -- > > 2.40.0.348.gf938b09366-goog > > > > -- > Peter Xu >