Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.

David Hildenbrand <david@xxxxxxxxxx> · Fri, 18 Oct 2024 09:26:30 +0200

On 18.10.24 08:48, Kanchana P Sridhar wrote:
This patch invokes the swapin_readahead() based batching interface to
prefetch a batch of 4K folios for zswap load with batch decompressions
in parallel using IAA hardware. swapin_readahead() prefetches folios based
on vm.page-cluster and the usefulness of prior prefetches to the
workload. As folios are created in the swapcache and the readahead code
calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
respective folio_batches get populated with the folios to be read.

Finally, the swapin_readahead() procedures will call the newly added
process_ra_batch_of_same_type() which:

  1) Reads all the non_zswap_batch folios sequentially by calling
     swap_read_folio().
  2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
     zswap_finish_load_batch() that finally decompresses each
     SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
     batch of say, 32 folios) in parallel with IAA.

Within do_swap_page(), we try to benefit from batch decompressions in both
these scenarios:

  1) single-mapped, SWP_SYNCHRONOUS_IO:
       We call swapin_readahead() with "single_mapped_path = true". This is
       done only in the !zswap_never_enabled() case.
  2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
       We call swapin_readahead() with "single_mapped_path = false".

This will place folios in the swapcache: a design choice that handles cases
where a folio that is "single-mapped" in process 1 could be prefetched in
process 2; and handles highly contended server scenarios with stability.
There are checks added at the end of do_swap_page(), after the folio has
been successfully loaded, to detect if the single-mapped swapcache folio is
still single-mapped, and if so, folio_free_swap() is called on the folio.

Within the swapin_readahead() functions, if single_mapped_path is true, and
either the platform does not have IAA, or, if the platform has IAA and the
user selects a software compressor for zswap (details of sysfs knob
follow), readahead/batching are skipped and the folio is loaded using
zswap_load().

A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for platforms that have IAA, zswap_load_batching_enabled() is true, and we
want to give the user the option to run experiments with IAA and with
software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):

For IAA:
  echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled

For software compressors:
  echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled

If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.

Thanks Ying Huang for the really helpful brainstorming discussions on the
swap_read_folio() plug design.

Suggested-by: Ying Huang <ying.huang@xxxxxxxxx>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
---
  mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
  mm/shmem.c      |   2 +-
  mm/swap.h       |  12 ++--
  mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
  mm/swapfile.c   |   2 +-
  5 files changed, 299 insertions(+), 61 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b5745b9ffdf7..9655b85fc243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
  	return 0;
  }
  
+/*
+ * swapin readahead based batching interface for zswap batched loads using IAA:
+ *
+ * Should only be called for and if the faulting swap entry in do_swap_page
+ * is single-mapped and SWP_SYNCHRONOUS_IO.
+ *
+ * Detect if the folio is in the swapcache, is still mapped to only this
+ * process, and further, there are no additional references to this folio
+ * (for e.g. if another process simultaneously readahead this swap entry
+ * while this process was handling the page-fault, and got a pointer to the
+ * folio allocated by this process in the swapcache), besides the references
+ * that were obtained within __read_swap_cache_async() by this process that is
+ * faulting in this single-mapped swap entry.
+ */

How is this supposed to work for large folios?

+static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
+						   struct folio *folio)
+{
+	if (!folio_test_swapcache(folio))
+		return false;
+
+	if (__swap_count(entry) != 0)
+		return false;
+
+	/*
+	 * The folio ref count for a single-mapped folio that was allocated
+	 * in __read_swap_cache_async(), can be a maximum of 3. These are the
+	 * incrementors of the folio ref count in __read_swap_cache_async():
+	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
+	 */
+
+	if (folio_ref_count(folio) <= 3)
+		return true;
+
+	return false;
+}
+
  static inline bool should_try_to_free_swap(struct folio *folio,
  					   struct vm_area_struct *vma,
  					   unsigned int fault_flags)
@@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  	swp_entry_t entry;
  	pte_t pte;
  	vm_fault_t ret = 0;
+	bool single_mapped_swapcache = false;
  	void *shadow = NULL;
  	int nr_pages;
  	unsigned long page_idx;
@@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
  	if (!folio) {
  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
  		    __swap_count(entry) == 1) {
-			/* skip swapcache */
-			folio = alloc_swap_folio(vmf);
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				nr_pages = folio_nr_pages(folio);
-				if (folio_test_large(folio))
-					entry.val = ALIGN_DOWN(entry.val, nr_pages);
-				/*
-				 * Prevent parallel swapin from proceeding with
-				 * the cache flag. Otherwise, another thread
-				 * may finish swapin first, free the entry, and
-				 * swapout reusing the same entry. It's
-				 * undetectable as pte_same() returns true due
-				 * to entry reuse.
-				 */
-				if (swapcache_prepare(entry, nr_pages)) {
+			if (zswap_never_enabled()) {
+				/* skip swapcache */
+				folio = alloc_swap_folio(vmf);
+				if (folio) {
+					__folio_set_locked(folio);
+					__folio_set_swapbacked(folio);
+
+					nr_pages = folio_nr_pages(folio);
+					if (folio_test_large(folio))
+						entry.val = ALIGN_DOWN(entry.val, nr_pages);
  					/*
-					 * Relax a bit to prevent rapid
-					 * repeated page faults.
+					 * Prevent parallel swapin from proceeding with
+					 * the cache flag. Otherwise, another thread
+					 * may finish swapin first, free the entry, and
+					 * swapout reusing the same entry. It's
+					 * undetectable as pte_same() returns true due
+					 * to entry reuse.
  					 */
-					add_wait_queue(&swapcache_wq, &wait);
-					schedule_timeout_uninterruptible(1);
-					remove_wait_queue(&swapcache_wq, &wait);
-					goto out_page;
+					if (swapcache_prepare(entry, nr_pages)) {
+						/*
+						 * Relax a bit to prevent rapid
+						 * repeated page faults.
+						 */
+						add_wait_queue(&swapcache_wq, &wait);
+						schedule_timeout_uninterruptible(1);
+						remove_wait_queue(&swapcache_wq, &wait);
+						goto out_page;
+					}
+					need_clear_cache = true;
+
+					mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
+
+					shadow = get_shadow_from_swap_cache(entry);
+					if (shadow)
+						workingset_refault(folio, shadow);
+
+					folio_add_lru(folio);
+
+					/* To provide entry to swap_read_folio() */
+					folio->swap = entry;
+					swap_read_folio(folio, NULL, NULL, NULL);
+					folio->private = NULL;
+				}
+			} else {
+				/*
+				 * zswap is enabled or was enabled at some point.
+				 * Don't skip swapcache.
+				 *
+				 * swapin readahead based batching interface
+				 * for zswap batched loads using IAA:
+				 *
+				 * Readahead is invoked in this path only if
+				 * the sys swap "singlemapped_ra_enabled" swap
+				 * parameter is set to true. By default,
+				 * "singlemapped_ra_enabled" is set to false,
+				 * the recommended setting for software compressors.
+				 * For IAA, if "singlemapped_ra_enabled" is set
+				 * to true, readahead will be deployed in this path
+				 * as well.
+				 *
+				 * For single-mapped pages, the batching interface
+				 * calls __read_swap_cache_async() to allocate and
+				 * place the faulting page in the swapcache. This is
+				 * to handle a scenario where the faulting page in
+				 * this process happens to simultaneously be a
+				 * readahead page in another process. By placing the
+				 * single-mapped faulting page in the swapcache,
+				 * we avoid race conditions and duplicate page
+				 * allocations under these scenarios.
+				 */
+				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+							 vmf, true);
+				if (!folio) {
+					ret = VM_FAULT_OOM;
+					goto out;
  				}
-				need_clear_cache = true;
-
-				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
-
-				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
  
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, NULL, NULL, NULL);
-				folio->private = NULL;
-			}
+				single_mapped_swapcache = true;
+				nr_pages = folio_nr_pages(folio);
+				swapcache = folio;
+			} /* swapin with zswap support. */
  		} else {
  			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
+						 vmf, false);
  			swapcache = folio;

I'm sorry, but making this function ever more complicated and ugly is 
not going to fly. The zswap special casing is quite ugly here as well.

Is there a way forward that we can make this code actually readable and 
avoid zswap special casing?

--
Cheers,

David / dhildenb