On Fri, Dec 22, 2023 at 3:24 AM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > From: Kairui Song <kasong@xxxxxxxxxxx> > > When lru_gen is aging, it will update mm counters page by page, > which causes a higher overhead if age happens frequently or there > are a lot of pages in one generation getting moved. > Optimize this by doing the counter update in batch. > > Although most __mod_*_state has its own caches the overhead > is still observable. > > Tested in a 4G memcg on a EPYC 7K62 with: > > memcached -u nobody -m 16384 -s /tmp/memcached.socket \ > -a 0766 -t 16 -B binary & > > memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys \ > --key-minimum=1 --key-maximum=16000000 -d 1024 \ > --ratio=1:0 --key-pattern=P:P -c 2 -t 16 --pipeline 8 -x 6 > > Average result of 18 test runs: > > Before: 44017.78 Ops/sec > After: 44687.08 Ops/sec (+1.5%) > > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx> > --- > mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 55 insertions(+), 9 deletions(-) Usually most reclaim activity happens in kswapd, e.g., from the MongoDB benchmark (--duration=900): pgscan_kswapd 11294317 pgscan_direct 128 And kswapd always has current->reclaim_state->mm_walk. So the following should bring the vast majority of the improvement (assuming it's not noise) with far less code change: diff --git a/mm/vmscan.c b/mm/vmscan.c index 9dd8977de5a2..c06e00635d2b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3095,6 +3095,8 @@ static int folio_update_gen(struct folio *folio, int gen) static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming) { int type = folio_is_file_lru(folio); + struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk; + struct lru_gen_folio *lrugen = &lruvec->lrugen; int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); unsigned long new_flags, old_flags = READ_ONCE(folio->flags); @@ -3116,7 +3118,10 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai new_flags |= BIT(PG_reclaim); } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); - lru_gen_update_size(lruvec, folio, old_gen, new_gen); + if (walk) + update_batch_size(walk, folio, old_gen, new_gen); + else + lru_gen_update_size(lruvec, folio, old_gen, new_gen); return new_gen; } @@ -3739,6 +3744,8 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) int prev, next; int type, zone; struct lru_gen_folio *lrugen = &lruvec->lrugen; + struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk; + restart: spin_lock_irq(&lruvec->lru_lock); @@ -3758,6 +3765,9 @@ static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) goto restart; } + if (walk && walk->batched) + reset_batch_size(lruvec, walk); + /* * Update the active/inactive LRU sizes for compatibility. Both sides of * the current max_seq need to be covered, since max_seq+1 can overlap