Hello Yu, On Tue, Feb 08, 2022 at 01:18:54AM -0700, Yu Zhao wrote: > @@ -92,11 +92,196 @@ static __always_inline enum lru_list folio_lru_list(struct folio *folio) > return lru; > } > > +#ifdef CONFIG_LRU_GEN > + > +static inline bool lru_gen_enabled(void) > +{ > + return true; > +} > + > +static inline bool lru_gen_in_fault(void) > +{ > + return current->in_lru_fault; > +} > + > +static inline int lru_gen_from_seq(unsigned long seq) > +{ > + return seq % MAX_NR_GENS; > +} > + > +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen) > +{ > + unsigned long max_seq = lruvec->lrugen.max_seq; > + > + VM_BUG_ON(gen >= MAX_NR_GENS); > + > + /* see the comment on MIN_NR_GENS */ > + return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1); > +} I'm still reading the series, so correct me if I'm wrong: the "active" set is split into two generations for the sole purpose of the second-chance policy for fresh faults, right? If so, it'd be better to have the comment here instead of down by MIN_NR_GENS. This is the place that defines what "active" is, so this is where the reader asks what it means and what it implies. The definition of MIN_NR_GENS can be briefer: "need at least two for second chance, see lru_gen_is_active() for details". > +static inline void lru_gen_update_size(struct lruvec *lruvec, enum lru_list lru, > + int zone, long delta) > +{ > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + > + lockdep_assert_held(&lruvec->lru_lock); > + WARN_ON_ONCE(delta != (int)delta); > + > + __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, delta); > + __mod_zone_page_state(&pgdat->node_zones[zone], NR_ZONE_LRU_BASE + lru, delta); > +} This is a duplicate of update_lru_size(), please use that instead. Yeah technically you don't need the mem_cgroup_update_lru_size() but that's not worth sweating over, better to keep it simple. > +static inline void lru_gen_balance_size(struct lruvec *lruvec, struct folio *folio, > + int old_gen, int new_gen) lru_gen_update_lru_sizes() for this one would be more descriptive imo and in line with update_lru_size() that it's built on. > +{ > + int type = folio_is_file_lru(folio); > + int zone = folio_zonenum(folio); > + int delta = folio_nr_pages(folio); > + enum lru_list lru = type * LRU_INACTIVE_FILE; > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + > + VM_BUG_ON(old_gen != -1 && old_gen >= MAX_NR_GENS); > + VM_BUG_ON(new_gen != -1 && new_gen >= MAX_NR_GENS); > + VM_BUG_ON(old_gen == -1 && new_gen == -1); Could be a bit easier to read quickly with high-level descriptions: > + if (old_gen >= 0) > + WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone], > + lrugen->nr_pages[old_gen][type][zone] - delta); > + if (new_gen >= 0) > + WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone], > + lrugen->nr_pages[new_gen][type][zone] + delta); > + /* Addition */ > + if (old_gen < 0) { > + if (lru_gen_is_active(lruvec, new_gen)) > + lru += LRU_ACTIVE; > + lru_gen_update_size(lruvec, lru, zone, delta); > + return; > + } > + /* Removal */ > + if (new_gen < 0) { > + if (lru_gen_is_active(lruvec, old_gen)) > + lru += LRU_ACTIVE; > + lru_gen_update_size(lruvec, lru, zone, -delta); > + return; > + } > + /* Promotion */ > + if (!lru_gen_is_active(lruvec, old_gen) && lru_gen_is_active(lruvec, new_gen)) { > + lru_gen_update_size(lruvec, lru, zone, -delta); > + lru_gen_update_size(lruvec, lru + LRU_ACTIVE, zone, delta); > + } > + > + /* Promotion is legit while a page is on an LRU list, but demotion isn't. */ /* Demotion happens during aging when pages are isolated, never on-LRU */ > + VM_BUG_ON(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(lruvec, new_gen)); > +} On that note, please move introduction of the promotion and demotion bits to the next patch. They aren't used here yet, and I spent some time jumping around patches to verify the promotion callers and confirm the validy of the BUG_ON. > +static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming) > +{ > + int gen; > + unsigned long old_flags, new_flags; > + int type = folio_is_file_lru(folio); > + int zone = folio_zonenum(folio); > + struct lru_gen_struct *lrugen = &lruvec->lrugen; > + > + if (folio_test_unevictable(folio) || !lrugen->enabled) > + return false; These two checks should be in the callsite and the function should return void. Otherwise you can't understand the callsite without drilling down into lrugen code, even if lrugen is disabled. folio_add_lru() gets it right. > + /* > + * There are three common cases for this page: > + * 1) If it shouldn't be evicted, e.g., it was just faulted in, add it > + * to the youngest generation. "shouldn't be evicted" makes it sound like mlock. But they should just be evicted last, right? Maybe: /* * Pages start in different generations depending on * advance knowledge we have about their hotness and * evictability: * * 1. Already active pages start out youngest. This can be * fresh faults, or refaults of previously hot pages. * 2. Cold pages that require writeback before becoming * evictable start on the second oldest generation. * 3. Everything else (clean, cold) starts old. */ On that note, I think #1 is reintroducing a problem we have fixed before, which is trashing the workingset with a flood of use-once mmapped pages. It's the classic scenario where LFU beats LRU. Mapped streaming IO isn't very common, but it does happen. See these commits: dfc8d636cdb95f7b792d5ba8c9f3b295809c125d 31c0569c3b0b6cc8a867ac6665ca081553f7984c 645747462435d84c6c6a64269ed49cc3015f753d