RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Fri, 18 Oct 2024 18:09:04 +0000

Hi David,

> -----Original Message-----
> From: David Hildenbrand <david@xxxxxxxxxx>
> Sent: Friday, October 18, 2024 12:27 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux-
> kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx;
> yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx;
> chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx;
> ryan.roberts@xxxxxxx; Huang, Ying <ying.huang@xxxxxxxxx>;
> 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; hughd@xxxxxxxxxx;
> willy@xxxxxxxxxxxxx; bfoster@xxxxxxxxxx; dchinner@xxxxxxxxxx;
> chrisl@xxxxxxxxxx
> Cc: Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
> <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
> 
> On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > This patch invokes the swapin_readahead() based batching interface to
> > prefetch a batch of 4K folios for zswap load with batch decompressions
> > in parallel using IAA hardware. swapin_readahead() prefetches folios based
> > on vm.page-cluster and the usefulness of prior prefetches to the
> > workload. As folios are created in the swapcache and the readahead code
> > calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > respective folio_batches get populated with the folios to be read.
> >
> > Finally, the swapin_readahead() procedures will call the newly added
> > process_ra_batch_of_same_type() which:
> >
> >   1) Reads all the non_zswap_batch folios sequentially by calling
> >      swap_read_folio().
> >   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> >      zswap_finish_load_batch() that finally decompresses each
> >      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> >      batch of say, 32 folios) in parallel with IAA.
> >
> > Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > these scenarios:
> >
> >   1) single-mapped, SWP_SYNCHRONOUS_IO:
> >        We call swapin_readahead() with "single_mapped_path = true". This is
> >        done only in the !zswap_never_enabled() case.
> >   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >        We call swapin_readahead() with "single_mapped_path = false".
> >
> > This will place folios in the swapcache: a design choice that handles cases
> > where a folio that is "single-mapped" in process 1 could be prefetched in
> > process 2; and handles highly contended server scenarios with stability.
> > There are checks added at the end of do_swap_page(), after the folio has
> > been successfully loaded, to detect if the single-mapped swapcache folio is
> > still single-mapped, and if so, folio_free_swap() is called on the folio.
> >
> > Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > either the platform does not have IAA, or, if the platform has IAA and the
> > user selects a software compressor for zswap (details of sysfs knob
> > follow), readahead/batching are skipped and the folio is loaded using
> > zswap_load().
> >
> > A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> > want to give the user the option to run experiments with IAA and with
> > software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >
> > For IAA:
> >   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > For software compressors:
> >   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> > prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > path.
> >
> > Thanks Ying Huang for the really helpful brainstorming discussions on the
> > swap_read_folio() plug design.
> >
> > Suggested-by: Ying Huang <ying.huang@xxxxxxxxx>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
> > ---
> >   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++------
> -----
> >   mm/shmem.c      |   2 +-
> >   mm/swap.h       |  12 ++--
> >   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> >   mm/swapfile.c   |   2 +-
> >   5 files changed, 299 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b5745b9ffdf7..9655b85fc243 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> >   	return 0;
> >   }
> >
> > +/*
> > + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > + *
> > + * Should only be called for and if the faulting swap entry in do_swap_page
> > + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > + *
> > + * Detect if the folio is in the swapcache, is still mapped to only this
> > + * process, and further, there are no additional references to this folio
> > + * (for e.g. if another process simultaneously readahead this swap entry
> > + * while this process was handling the page-fault, and got a pointer to the
> > + * folio allocated by this process in the swapcache), besides the references
> > + * that were obtained within __read_swap_cache_async() by this process
> that is
> > + * faulting in this single-mapped swap entry.
> > + */
> 
> How is this supposed to work for large folios?

Thanks for your code review comments. The main idea behind this
patch-series is to work with the existing kernel page-fault granularity of 4K
folios, that swapin_readahead() builds upon to prefetch other "useful"
4K folios. The intent is to not try to make modifications at page-fault time
to opportunistically synthesize large folios for swapin.

As we know, __read_swap_cache_async() allocates an order-0 folio, which
explains the implementation of should_free_singlemap_swapcache() in this
patch. IOW, this is not supposed to work for large folios based on the existing
page-fault behavior and without making any modifications to that.

> 
> > +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> > +						   struct folio *folio)
> > +{
> > +	if (!folio_test_swapcache(folio))
> > +		return false;
> > +
> > +	if (__swap_count(entry) != 0)
> > +		return false;
> > +
> > +	/*
> > +	 * The folio ref count for a single-mapped folio that was allocated
> > +	 * in __read_swap_cache_async(), can be a maximum of 3. These are
> the
> > +	 * incrementors of the folio ref count in __read_swap_cache_async():
> > +	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> > +	 */
> > +
> > +	if (folio_ref_count(folio) <= 3)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >   static inline bool should_try_to_free_swap(struct folio *folio,
> >   					   struct vm_area_struct *vma,
> >   					   unsigned int fault_flags)
> > @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >   	swp_entry_t entry;
> >   	pte_t pte;
> >   	vm_fault_t ret = 0;
> > +	bool single_mapped_swapcache = false;
> >   	void *shadow = NULL;
> >   	int nr_pages;
> >   	unsigned long page_idx;
> > @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault
> *vmf)
> >   	if (!folio) {
> >   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >   		    __swap_count(entry) == 1) {
> > -			/* skip swapcache */
> > -			folio = alloc_swap_folio(vmf);
> > -			if (folio) {
> > -				__folio_set_locked(folio);
> > -				__folio_set_swapbacked(folio);
> > -
> > -				nr_pages = folio_nr_pages(folio);
> > -				if (folio_test_large(folio))
> > -					entry.val = ALIGN_DOWN(entry.val,
> nr_pages);
> > -				/*
> > -				 * Prevent parallel swapin from proceeding
> with
> > -				 * the cache flag. Otherwise, another thread
> > -				 * may finish swapin first, free the entry, and
> > -				 * swapout reusing the same entry. It's
> > -				 * undetectable as pte_same() returns true
> due
> > -				 * to entry reuse.
> > -				 */
> > -				if (swapcache_prepare(entry, nr_pages)) {
> > +			if (zswap_never_enabled()) {
> > +				/* skip swapcache */
> > +				folio = alloc_swap_folio(vmf);
> > +				if (folio) {
> > +					__folio_set_locked(folio);
> > +					__folio_set_swapbacked(folio);
> > +
> > +					nr_pages = folio_nr_pages(folio);
> > +					if (folio_test_large(folio))
> > +						entry.val =
> ALIGN_DOWN(entry.val, nr_pages);
> >   					/*
> > -					 * Relax a bit to prevent rapid
> > -					 * repeated page faults.
> > +					 * Prevent parallel swapin from
> proceeding with
> > +					 * the cache flag. Otherwise, another
> thread
> > +					 * may finish swapin first, free the
> entry, and
> > +					 * swapout reusing the same entry.
> It's
> > +					 * undetectable as pte_same()
> returns true due
> > +					 * to entry reuse.
> >   					 */
> > -					add_wait_queue(&swapcache_wq,
> &wait);
> > -
> 	schedule_timeout_uninterruptible(1);
> > -
> 	remove_wait_queue(&swapcache_wq, &wait);
> > -					goto out_page;
> > +					if (swapcache_prepare(entry,
> nr_pages)) {
> > +						/*
> > +						 * Relax a bit to prevent rapid
> > +						 * repeated page faults.
> > +						 */
> > +
> 	add_wait_queue(&swapcache_wq, &wait);
> > +
> 	schedule_timeout_uninterruptible(1);
> > +
> 	remove_wait_queue(&swapcache_wq, &wait);
> > +						goto out_page;
> > +					}
> > +					need_clear_cache = true;
> > +
> > +
> 	mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> > +
> > +					shadow =
> get_shadow_from_swap_cache(entry);
> > +					if (shadow)
> > +						workingset_refault(folio,
> shadow);
> > +
> > +					folio_add_lru(folio);
> > +
> > +					/* To provide entry to
> swap_read_folio() */
> > +					folio->swap = entry;
> > +					swap_read_folio(folio, NULL, NULL,
> NULL);
> > +					folio->private = NULL;
> > +				}
> > +			} else {
> > +				/*
> > +				 * zswap is enabled or was enabled at some
> point.
> > +				 * Don't skip swapcache.
> > +				 *
> > +				 * swapin readahead based batching
> interface
> > +				 * for zswap batched loads using IAA:
> > +				 *
> > +				 * Readahead is invoked in this path only if
> > +				 * the sys swap "singlemapped_ra_enabled"
> swap
> > +				 * parameter is set to true. By default,
> > +				 * "singlemapped_ra_enabled" is set to false,
> > +				 * the recommended setting for software
> compressors.
> > +				 * For IAA, if "singlemapped_ra_enabled" is
> set
> > +				 * to true, readahead will be deployed in this
> path
> > +				 * as well.
> > +				 *
> > +				 * For single-mapped pages, the batching
> interface
> > +				 * calls __read_swap_cache_async() to
> allocate and
> > +				 * place the faulting page in the swapcache.
> This is
> > +				 * to handle a scenario where the faulting
> page in
> > +				 * this process happens to simultaneously be
> a
> > +				 * readahead page in another process. By
> placing the
> > +				 * single-mapped faulting page in the
> swapcache,
> > +				 * we avoid race conditions and duplicate
> page
> > +				 * allocations under these scenarios.
> > +				 */
> > +				folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > +							 vmf, true);
> > +				if (!folio) {
> > +					ret = VM_FAULT_OOM;
> > +					goto out;
> >   				}
> > -				need_clear_cache = true;
> > -
> > -				mem_cgroup_swapin_uncharge_swap(entry,
> nr_pages);
> > -
> > -				shadow =
> get_shadow_from_swap_cache(entry);
> > -				if (shadow)
> > -					workingset_refault(folio, shadow);
> > -
> > -				folio_add_lru(folio);
> >
> > -				/* To provide entry to swap_read_folio() */
> > -				folio->swap = entry;
> > -				swap_read_folio(folio, NULL, NULL, NULL);
> > -				folio->private = NULL;
> > -			}
> > +				single_mapped_swapcache = true;
> > +				nr_pages = folio_nr_pages(folio);
> > +				swapcache = folio;
> > +			} /* swapin with zswap support. */
> >   		} else {
> >   			folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > -						vmf);
> > +						 vmf, false);
> >   			swapcache = folio;
> 
> I'm sorry, but making this function ever more complicated and ugly is
> not going to fly. The zswap special casing is quite ugly here as well.
> 
> Is there a way forward that we can make this code actually readable and
> avoid zswap special casing?

Yes, I realize this is now quite cluttered. I need to think some more about
how to make this more readable, and would appreciate suggestions
towards this.

Thanks,
Kanchana

> 
> --
> Cheers,
> 
> David / dhildenb