[RFC PATCH v2 5/5] workingset, lru_gen: apply refault-distance based re-activation

Kairui Song <ryncsn@xxxxxxxxx> · Wed, 13 Sep 2023 02:45:11 +0800

From: Kairui Song <kasong@xxxxxxxxxxx>

I noticed MGLRU not working very well on certain workflows, which is
observed on some heavily stressed databases. That is when the file
page workingset size exceeds total memory, and the access distance
(the left-shift time of a page before it gets activated, considering
LRU starts from right) of file pages also larger than total memory.
All file pages are stuck on the oldest generation and getting
read-in then evicted permutably. Despite anon pages being idle,
they never get aged. PID controller didn't kickin until there are some
minor access pattern changes. And file pages are not promoted
or reused.

Even though the memory can't cover the whole workingset, the
refault-distance based re-activation can help hold part of the
workingset in-memory to help reduce the IO workload significantly.

So apply it for MGLRU as well. The updated refault-distance model
fits well for MGLRU in most cases, if we just consider the last two
generation as the inactive LRU and the first two generations as
active LRU.

Some adjustment is done to fit the logic better, also make the
refault-distance contributed to page tiering and PID refault detection
of MGLRU:

- If a tier-0 page have a qualified refault-distance, just promote
  it to higher tier, send it to second oldest gen.
- If a tier >= 1 page have a qualified refault-distance, mark it as
  active and send it to youngest gen.
- Increase the reference of every page that have a qualified refault-distance
  and increase the PID countroled refault rate of the updated tier.

Following benchmark showed improvement. To simulate the workflow, I
setup a 3-replicated mongodb cluster, each use 5 gb of cache and 10g of
oplog, on a 32G VM. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.

Before the patch (with 10G swap, the result won't change whether
swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 904 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     503             27150226136.4   0.02 txn/s
------------------------------------------------------------------
  TOTAL           503             27150226136.4   0.02 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 53391
workingset_refault_anon 0
workingset_refault_file 23856735
workingset_activate_anon 0
workingset_activate_file 23845737
workingset_restore_anon 0
workingset_restore_file 18280692
workingset_nodereclaim 1024

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31837        6752         379          23       24706       24607
Swap:         10239           0       10239

After the patch (with 10G swap on same disk, similiar result using ZRAM):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 903 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     2575            27094953498.8   0.10 txn/s
------------------------------------------------------------------
  TOTAL           2575            27094953498.8   0.10 txn/s

$ cat /proc/vmstat | grep working
workingset_nodes 78249
workingset_refault_anon 10139
workingset_refault_file 23001863
workingset_activate_anon 7238
workingset_activate_file 6718032
workingset_restore_anon 7432
workingset_restore_file 6719406
workingset_nodereclaim 9747

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31837        7376         320           3       24140       24014
Swap:         10239        1662        8577

The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. The result is also better with
lower test stress, testing with lower stress also shows a improvement.

I also checked the benchmark with memtier/memcached and fio,
using similar setup as in commit ac35a4902374 but scaled down to fit in
my test environment:

  memtier test (16G ramdisk as swap, 2G memcg limit, VM on a EPYC 7K62):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 \
    -t 12 -B binary &
  memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
    --key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \
    -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6

  fio test (16G ramdisk on /mnt, 4G memcg limit, VM on a EPYC 7K62):
  fio -name=refault --numjobs=14 --directory=/mnt --size=1024m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=random --norandommap \
    --time_based --ramp_time=5m --runtime=5m --group_reporting

  mysql test (15G buffer pool with 16G memcg limit, VM on a EPYC 7K62):
    sysbench /usr/share/sysbench/oltp_read_only.lua <auth and db params> \
      --tables=48 --table-size=2000000 --threads=32 --time=1800

Before this patch:
memtier: 379329.77 op/s
fio: 5786.8k iops
mysql: 150190.43 qps

After this patch:
memtier: 373877.41 op/s
fio: 5805.5k iops
mysql: 150220.93 qps

The test looks ok except a bit extra overhead introduced by atomic
operations introduced, there seems to be no LRU accuracy drop.

Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
---
 mm/workingset.c | 78 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 25 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index ff7587456b7f..1fa336054528 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -175,6 +175,7 @@
 			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_BITS	(BITS_PER_LONG - (EVICTION_SHIFT))
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
+#define LRU_GEN_EVICTION_BITS	(EVICTION_BITS - LRU_REFS_WIDTH - LRU_GEN_WIDTH)
 
 /*
  * Eviction timestamps need to be able to cover the full range of
@@ -185,6 +186,7 @@
  * evictions into coarser buckets by shaving off lower timestamp bits.
  */
 static unsigned int bucket_order __read_mostly;
+static unsigned int lru_gen_bucket_order __read_mostly;
 
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
 			 bool workingset)
@@ -240,7 +242,7 @@ static inline bool lru_refault(struct mem_cgroup *memcg,
 			       int bits, int bucket_order)
 {
 	unsigned long refault, distance;
-	unsigned long workingset, active, inactive, inactive_file, inactive_anon = 0;
+	unsigned long active, inactive_file, inactive_anon = 0;
 
 	eviction <<= bucket_order;
 	refault = atomic_long_read(&lruvec->nonresident_age);
@@ -280,7 +282,7 @@ static inline bool lru_refault(struct mem_cgroup *memcg,
 	 * active pages with one time refaulted page may not be a good idea.
 	 */
 	if (active >= (inactive_anon + inactive_file))
-		return distance < inactive_anon + inactive_file;
+		return distance < (inactive_anon + inactive_file);
 	else
 		return distance < active + (file ? inactive_anon : inactive_file);
 }
@@ -333,10 +335,14 @@ static void *lru_gen_eviction(struct folio *folio)
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
 	min_seq = READ_ONCE(lrugen->min_seq[type]);
+
 	token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
+	token <<= LRU_GEN_EVICTION_BITS;
+	token |= lru_eviction(lruvec, LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
 
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
 
 	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
 }
@@ -351,44 +357,55 @@ static bool lru_gen_test_recent(struct lruvec *lruvec, bool file,
 	unsigned long min_seq;
 
 	min_seq = READ_ONCE(lruvec->lrugen.min_seq[file]);
+	token >>= LRU_GEN_EVICTION_BITS;
 	return (token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
 }
 
 static void lru_gen_refault(struct folio *folio, void *shadow)
 {
 	int memcgid;
-	bool recent;
+	bool refault;
 	bool workingset;
 	unsigned long token;
+	bool recent = false;
+	int refault_tier = 0;
 	int hist, tier, refs;
 	struct lruvec *lruvec;
+	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat;
 	struct lru_gen_folio *lrugen;
 	int type = folio_is_file_lru(folio);
 	int delta = folio_nr_pages(folio);
 
-	rcu_read_lock();
-
 	unpack_shadow(shadow, &memcgid, &pgdat, &token, &workingset);
-	lruvec = mem_cgroup_lruvec(mem_cgroup_from_id(memcgid), pgdat);
-	if (lruvec != folio_lruvec(folio))
-		goto unlock;
+	memcg = mem_cgroup_from_id(memcgid);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	/* memcg can be NULL, go through lruvec */
+	memcg = lruvec_memcg(lruvec);
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
-
-	recent = lru_gen_test_recent(lruvec, type, token);
-	if (!recent)
-		goto unlock;
+	refault = lru_refault(memcg, lruvec, token, type,
+			      LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
+	if (lruvec == folio_lruvec(folio))
+		recent = lru_gen_test_recent(lruvec, type, token);
+	if (!recent && !refault)
+		return;
 
 	lrugen = &lruvec->lrugen;
-
 	hist = lru_hist_from_seq(READ_ONCE(lrugen->min_seq[type]));
 	/* see the comment in folio_lru_refs() */
+	token >>= LRU_GEN_EVICTION_BITS;
 	refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
 	tier = lru_tier_from_refs(refs);
-
-	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
-	mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+	refault_tier = tier;
+
+	if (refault) {
+		if (refs)
+			folio_set_active(folio);
+		if (refs != BIT(LRU_REFS_WIDTH))
+			refault_tier = lru_tier_from_refs(refs + 1);
+		mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+	}
 
 	/*
 	 * Count the following two cases as stalls:
@@ -397,12 +414,17 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 	 * 2. For pages accessed multiple times through file descriptors,
 	 *    numbers of accesses might have been out of the range.
 	 */
-	if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
+	if (refault || lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
 		folio_set_workingset(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
 	}
-unlock:
-	rcu_read_unlock();
+
+	if (recent && refault_tier == tier) {
+		atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
+	} else {
+		atomic_long_add(delta, &lrugen->avg_total[type][refault_tier]);
+		atomic_long_add(delta, &lrugen->avg_refaulted[type][refault_tier]);
+	}
 }
 
 #else /* !CONFIG_LRU_GEN */
@@ -524,16 +546,15 @@ void workingset_refault(struct folio *folio, void *shadow)
 	bool workingset;
 	long nr;
 
-	if (lru_gen_enabled()) {
-		lru_gen_refault(folio, shadow);
-		return;
-	}
-
 	/* Flush stats (and potentially sleep) before holding RCU read lock */
 	mem_cgroup_flush_stats_ratelimited();
-
 	rcu_read_lock();
 
+	if (lru_gen_enabled()) {
+		lru_gen_refault(folio, shadow);
+		goto out;
+	}
+
 	/*
 	 * The activation decision for this folio is made at the level
 	 * where the eviction occurred, as that is where the LRU order
@@ -780,6 +801,13 @@ static int __init workingset_init(void)
 	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
 	       EVICTION_BITS, max_order, bucket_order);
 
+#ifdef CONFIG_LRU_GEN
+	if (max_order > LRU_GEN_EVICTION_BITS)
+		lru_gen_bucket_order = max_order - LRU_GEN_EVICTION_BITS;
+	pr_info("workingset: lru_gen_timestamp_bits=%d lru_gen_bucket_order=%u\n",
+		LRU_GEN_EVICTION_BITS, lru_gen_bucket_order);
+#endif
+
 	ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
 	if (ret)
 		goto err;
-- 
2.41.0