[RFC PATCH 1/4] workingset: simplify and use a more intuitive model

Kairui Song <ryncsn@xxxxxxxxx> · Wed, 26 Jul 2023 02:57:29 +0800

From: Kairui Song <kasong@xxxxxxxxxxx>

This basically removed workingset_activation and reduced calls to
workingset_age_nonresident.

The idea behind this change is a new way to calculate the refault
distance which seems working fine in most cases and fits for MGLRU (in
later commits).

Current refault distance is based on two assumptions:
1. Activation of an inactive page will left-shift LRU pages (consider
   LRU starts from right).
2. Eviction of an inactive page will left-shift LRU pages.

Assumption 2 is correct, but assumption 1 is not always true,
the activated page could be anywhere in the LRU list, it only left-shift
the pages on its right. And one page can get activate/deactivated
for multiple times.

And MGLRU doesn't fit with this model, since there are multiple gens,
and pages are getting aged and activated constantly upon gen growth.

So instead we introduce a new idea here, "Shadow LRU Position". Simply
consider the evicted pages are still in memory, each has an eviction
sequence like before. Let the `nonresistence_age` be NA and get
increased for each eviction, so the "Shadow LRU Position" of one evicted
page will be:

    SP = ((NA's reading @ eviction) - (NA's reading @ current))

 +---------------------------------------+==========+========+
 |    *   Shadow LRU    O        O  O    | INACTIVE | ACTIVE |
 +----+----------------------------------+==========+========+
      |                                  |
      +----------------------------------+
      |             SP
  refault page                  O -> Hole left by previously refaulted page
                                * -> The page corresponding to SP

Now since SP simply stands for how much currently workflow could push a
page out of current memory, which also means if the page started on
INACTIVE part, it *may* get re-activated if it right shift SP slots into
the ACTIVE list and still doesn't go exceed total memory, which is:

  SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE

Which can be simplified to:

  SP < NR_ACTIVE

And since this is only an estimation, based on several hypothesis and it
actually violates the normal routine of LRU when LRU is working well,
so throttle this by two factor:

1. Previously refaulted pages may leave "holes" on the shadow LRU and
   decrease re-active rate for distant shadow pages.
2. When the ACTIVE part of LRU is long enough, chanllaging them by
   activating one-time faulted inactive page may not be a good idea so
   throttle it by the ratio of ACTIVE/INACTIVE.

Combined all above, we have:

Upon refault:
- If ACTIVE LRU is low, check if SP < NR_ACTIVE to check for
  re-activation.
- If ACTIVE LRU is high, check if
  SP < min(NR_ACTIVE, NR_INACTIVE) / (exponential ratio of ACTIVE / INACTIVE).

This is simpler than before since no longer need to do lruvec operations when
activating a page, and so far, a few benchmarks shows a fair result.

Using memtier and fio test from commit ac35a4902374 but scaled down
to fit in my test environment, and some other test results:

  memtier test (with 16G ramdisk as swap and 2G cgroup limit):
  memcached -u nobody -m 16384 -s /tmp/memcached.socket \
    -a 0700 -t 12 -B binary &
  memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
    --key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \
    -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6

  fio test (with 16G ramdisk on /mnt and 4G cgroup limit):
  fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
    --buffered=1 --ioengine=io_uring --iodepth=128 \
    --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
    --rw=randread --random_distribution=random --norandommap \
    --time_based --ramp_time=5m --runtime=5m --group_reporting

  Pgbench setup using phronix-test-suite with scale 1000 and
  50 clients on a 5G VM.

  Linux Kernel compliation test done with defconfig on a 2G VM.

Before:
memcached: 48157.04 ops/s
read: IOPS=2003k, BW=7823MiB/s (8203MB/s)(2292GiB/300001msec)
pgbench: 5845 qps
build-linux: 247.063

After:
memcached: 49144.55 ops/s
read: IOPS=2005k, BW=7832MiB/s (8212MB/s)(2294GiB/300002msec)
pgbench: 5832 qps
build-linux: 247.302

Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
---
 include/linux/swap.h |   2 -
 mm/swap.c            |   1 -
 mm/vmscan.c          |   2 -
 mm/workingset.c      | 217 +++++++++++++++++++++----------------------
 4 files changed, 108 insertions(+), 114 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 456546443f1f..43e48023c4c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -350,10 +350,8 @@ static inline void folio_set_swap_entry(struct folio *folio, swp_entry_t entry)
 
 /* linux/mm/workingset.c */
 bool workingset_test_recent(void *shadow, bool file, bool *workingset);
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
 void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
 void workingset_refault(struct folio *folio, void *shadow);
-void workingset_activation(struct folio *folio);
 
 /* Only track the nodes of mappings with shadow entries */
 void workingset_update_node(struct xa_node *node);
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..685b446fd4f9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,7 +482,6 @@ void folio_mark_accessed(struct folio *folio)
 		else
 			__lru_cache_activate_folio(folio);
 		folio_clear_referenced(folio);
-		workingset_activation(folio);
 	}
 	if (folio_test_idle(folio))
 		folio_clear_idle(folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..e7906f7fdc77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,8 +2539,6 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 		lruvec_add_folio(lruvec, folio);
 		nr_pages = folio_nr_pages(folio);
 		nr_moved += nr_pages;
-		if (folio_test_active(folio))
-			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
diff --git a/mm/workingset.c b/mm/workingset.c
index 4686ae363000..c0dea2c05f55 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -180,9 +180,10 @@
  */
 
 #define WORKINGSET_SHIFT 1
-#define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) +	\
+#define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
 			 WORKINGSET_SHIFT + NODES_SHIFT + \
 			 MEM_CGROUP_ID_SHIFT)
+#define EVICTION_BITS	(BITS_PER_LONG - (EVICTION_SHIFT))
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -226,8 +227,105 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	*workingsetp = workingset;
 }
 
-#ifdef CONFIG_LRU_GEN
+/*
+ * Get the distance reading at eviction time.
+ */
+static inline unsigned long lru_eviction(struct lruvec *lruvec,
+					    int bits, int bucket_order)
+{
+	unsigned long eviction = atomic_long_read(&lruvec->nonresident_age);
+
+	eviction >>= bucket_order;
+	eviction &= ~0UL >> (BITS_PER_LONG - bits);
+
+	return eviction;
+}
+
+/*
+ * Calculate and test refault distance
+ */
+static bool lru_refault(struct mem_cgroup *memcg,
+			struct lruvec *lruvec,
+			unsigned long eviction,
+			int bits, int bucket_order)
+{
+	unsigned long refault, distance;
+	unsigned long active, inactive;
+
+	eviction <<= bucket_order;
+	refault = atomic_long_read(&lruvec->nonresident_age);
+
+	/*
+	 * The unsigned subtraction here gives an accurate distance
+	 * across nonresident_age overflows in most cases. There is a
+	 * special case: usually, shadow entries have a short lifetime
+	 * and are either refaulted or reclaimed along with the inode
+	 * before they get too old.  But it is not impossible for the
+	 * nonresident_age to lap a shadow entry in the field, which
+	 * can then result in a false small refault distance, leading
+	 * to a false activation should this old entry actually
+	 * refault again.  However, earlier kernels used to deactivate
+	 * unconditionally with *every* reclaim invocation for the
+	 * longest time, so the occasional inappropriate activation
+	 * leading to pressure on the active list is not a problem.
+	 */
+	distance = (refault - eviction) & (~0UL >> (BITS_PER_LONG - bits));
+
+	active = lruvec_page_state(lruvec, NR_ACTIVE_FILE);
+	inactive = lruvec_page_state(lruvec, NR_INACTIVE_FILE);
+	if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
+		active += lruvec_page_state(lruvec, NR_ACTIVE_ANON);
+		inactive += lruvec_page_state(lruvec, NR_INACTIVE_ANON);
+	}
+
+	/*
+	 * When there are already enough active pages, be less aggressive
+	 * on activating pages, challenge already established workingset with
+	 * one time refaulted page may not be a good idea, especially as
+	 * the gap between active workingset and inactive queue grows larger.
+	 */
+	if (active > inactive)
+		return distance < inactive >> (1 + (fls_long(active) - fls_long(inactive)) / 2);
+
+	/*
+	 * Compare the distance to the existing workingset size. We
+	 * don't activate pages that couldn't stay resident even if
+	 * all the memory was available to the workingset. Whether
+	 * workingset competition needs to consider anon or not depends
+	 * on having free swap space.
+	 */
+	return distance < active;
+}
+
+/**
+ * workingset_age_nonresident - age non-resident entries as LRU ages
+ * @lruvec: the lruvec that was aged
+ * @nr_pages: the number of pages to count
+ *
+ * As in-memory pages are aged, non-resident pages need to be aged as
+ * well, in order for the refault distances later on to be comparable
+ * to the in-memory dimensions. This function allows reclaim and LRU
+ * operations to drive the non-resident aging along in parallel.
+ */
+static void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
+{
+	/*
+	 * Reclaiming a cgroup means reclaiming all its children in a
+	 * round-robin fashion. That means that each cgroup has an LRU
+	 * order that is composed of the LRU orders of its child
+	 * cgroups; and every page has an LRU position not just in the
+	 * cgroup that owns it, but in all of that group's ancestors.
+	 *
+	 * So when the physical inactive list of a leaf cgroup ages,
+	 * the virtual inactive lists of all its parents, including
+	 * the root cgroup's, age as well.
+	 */
+	do {
+		atomic_long_add(nr_pages, &lruvec->nonresident_age);
+	} while ((lruvec = parent_lruvec(lruvec)));
+}
 
+#ifdef CONFIG_LRU_GEN
 static void *lru_gen_eviction(struct folio *folio)
 {
 	int hist;
@@ -342,34 +440,6 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 #endif /* CONFIG_LRU_GEN */
 
-/**
- * workingset_age_nonresident - age non-resident entries as LRU ages
- * @lruvec: the lruvec that was aged
- * @nr_pages: the number of pages to count
- *
- * As in-memory pages are aged, non-resident pages need to be aged as
- * well, in order for the refault distances later on to be comparable
- * to the in-memory dimensions. This function allows reclaim and LRU
- * operations to drive the non-resident aging along in parallel.
- */
-void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
-{
-	/*
-	 * Reclaiming a cgroup means reclaiming all its children in a
-	 * round-robin fashion. That means that each cgroup has an LRU
-	 * order that is composed of the LRU orders of its child
-	 * cgroups; and every page has an LRU position not just in the
-	 * cgroup that owns it, but in all of that group's ancestors.
-	 *
-	 * So when the physical inactive list of a leaf cgroup ages,
-	 * the virtual inactive lists of all its parents, including
-	 * the root cgroup's, age as well.
-	 */
-	do {
-		atomic_long_add(nr_pages, &lruvec->nonresident_age);
-	} while ((lruvec = parent_lruvec(lruvec)));
-}
-
 /**
  * workingset_eviction - note the eviction of a folio from memory
  * @target_memcg: the cgroup that is causing the reclaim
@@ -396,11 +466,11 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
-	eviction = atomic_long_read(&lruvec->nonresident_age);
-	eviction >>= bucket_order;
+
+	eviction = lru_eviction(lruvec, EVICTION_BITS, bucket_order);
 	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
 	return pack_shadow(memcgid, pgdat, eviction,
-				folio_test_workingset(folio));
+			   folio_test_workingset(folio));
 }
 
 /**
@@ -418,9 +488,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 {
 	struct mem_cgroup *eviction_memcg;
 	struct lruvec *eviction_lruvec;
-	unsigned long refault_distance;
-	unsigned long workingset_size;
-	unsigned long refault;
 	int memcgid;
 	struct pglist_data *pgdat;
 	unsigned long eviction;
@@ -429,7 +496,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 		return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
 
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
-	eviction <<= bucket_order;
 
 	/*
 	 * Look up the memcg associated with the stored ID. It might
@@ -450,50 +516,10 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
 	eviction_memcg = mem_cgroup_from_id(memcgid);
 	if (!mem_cgroup_disabled() && !eviction_memcg)
 		return false;
-
 	eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
-	refault = atomic_long_read(&eviction_lruvec->nonresident_age);
 
-	/*
-	 * Calculate the refault distance
-	 *
-	 * The unsigned subtraction here gives an accurate distance
-	 * across nonresident_age overflows in most cases. There is a
-	 * special case: usually, shadow entries have a short lifetime
-	 * and are either refaulted or reclaimed along with the inode
-	 * before they get too old.  But it is not impossible for the
-	 * nonresident_age to lap a shadow entry in the field, which
-	 * can then result in a false small refault distance, leading
-	 * to a false activation should this old entry actually
-	 * refault again.  However, earlier kernels used to deactivate
-	 * unconditionally with *every* reclaim invocation for the
-	 * longest time, so the occasional inappropriate activation
-	 * leading to pressure on the active list is not a problem.
-	 */
-	refault_distance = (refault - eviction) & EVICTION_MASK;
-
-	/*
-	 * Compare the distance to the existing workingset size. We
-	 * don't activate pages that couldn't stay resident even if
-	 * all the memory was available to the workingset. Whether
-	 * workingset competition needs to consider anon or not depends
-	 * on having free swap space.
-	 */
-	workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
-	if (!file) {
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_INACTIVE_FILE);
-	}
-	if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) {
-		workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_ACTIVE_ANON);
-		if (file) {
-			workingset_size += lruvec_page_state(eviction_lruvec,
-						     NR_INACTIVE_ANON);
-		}
-	}
-
-	return refault_distance <= workingset_size;
+	return lru_refault(eviction_memcg, eviction_lruvec, eviction,
+			      EVICTION_BITS, bucket_order);
 }
 
 /**
@@ -543,7 +569,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 		goto out;
 
 	folio_set_active(folio);
-	workingset_age_nonresident(lruvec, nr);
 	mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);
 
 	/* Folio was active prior to eviction */
@@ -560,30 +585,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 	rcu_read_unlock();
 }
 
-/**
- * workingset_activation - note a page activation
- * @folio: Folio that is being activated.
- */
-void workingset_activation(struct folio *folio)
-{
-	struct mem_cgroup *memcg;
-
-	rcu_read_lock();
-	/*
-	 * Filter non-memcg pages here, e.g. unmap can call
-	 * mark_page_accessed() on VDSO pages.
-	 *
-	 * XXX: See workingset_refault() - this should return
-	 * root_mem_cgroup even for !CONFIG_MEMCG.
-	 */
-	memcg = folio_memcg_rcu(folio);
-	if (!mem_cgroup_disabled() && !memcg)
-		goto out;
-	workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
-out:
-	rcu_read_unlock();
-}
-
 /*
  * Shadow entries reflect the share of the working set that does not
  * fit into memory, so their number depends on the access pattern of
@@ -777,7 +778,6 @@ static struct lock_class_key shadow_nodes_key;
 
 static int __init workingset_init(void)
 {
-	unsigned int timestamp_bits;
 	unsigned int max_order;
 	int ret;
 
@@ -789,12 +789,11 @@ static int __init workingset_init(void)
 	 * some more pages at runtime, so keep working with up to
 	 * double the initial memory by using totalram_pages as-is.
 	 */
-	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
 	max_order = fls_long(totalram_pages() - 1);
-	if (max_order > timestamp_bits)
-		bucket_order = max_order - timestamp_bits;
+	if (max_order > EVICTION_BITS)
+		bucket_order = max_order - EVICTION_BITS;
 	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
-	       timestamp_bits, max_order, bucket_order);
+	       EVICTION_BITS, max_order, bucket_order);
 
 	ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow");
 	if (ret)
-- 
2.41.0