Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.

Nhat Pham <nphamcs@xxxxxxxxx> · Fri, 18 Oct 2024 10:21:09 -0700

On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
> On 18/10/2024 08:26, David Hildenbrand wrote:
> > On 18.10.24 08:48, Kanchana P Sridhar wrote:
> >> This patch invokes the swapin_readahead() based batching interface to
> >> prefetch a batch of 4K folios for zswap load with batch decompressions
> >> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> >> on vm.page-cluster and the usefulness of prior prefetches to the
> >> workload. As folios are created in the swapcache and the readahead code
> >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> >> respective folio_batches get populated with the folios to be read.
> >>
> >> Finally, the swapin_readahead() procedures will call the newly added
> >> process_ra_batch_of_same_type() which:
> >>
> >>   1) Reads all the non_zswap_batch folios sequentially by calling
> >>      swap_read_folio().
> >>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
> >>      zswap_finish_load_batch() that finally decompresses each
> >>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
> >>      batch of say, 32 folios) in parallel with IAA.
> >>
> >> Within do_swap_page(), we try to benefit from batch decompressions in both
> >> these scenarios:
> >>
> >>   1) single-mapped, SWP_SYNCHRONOUS_IO:
> >>        We call swapin_readahead() with "single_mapped_path = true". This is
> >>        done only in the !zswap_never_enabled() case.
> >>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >>        We call swapin_readahead() with "single_mapped_path = false".
> >>
> >> This will place folios in the swapcache: a design choice that handles cases
> >> where a folio that is "single-mapped" in process 1 could be prefetched in
> >> process 2; and handles highly contended server scenarios with stability.
> >> There are checks added at the end of do_swap_page(), after the folio has
> >> been successfully loaded, to detect if the single-mapped swapcache folio is
> >> still single-mapped, and if so, folio_free_swap() is called on the folio.
> >>
> >> Within the swapin_readahead() functions, if single_mapped_path is true, and
> >> either the platform does not have IAA, or, if the platform has IAA and the
> >> user selects a software compressor for zswap (details of sysfs knob
> >> follow), readahead/batching are skipped and the folio is loaded using
> >> zswap_load().
> >>
> >> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> >> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> >> want to give the user the option to run experiments with IAA and with
> >> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >>
> >> For IAA:
> >>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>
> >> For software compressors:
> >>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>
> >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> >> path.
> >>
> >> Thanks Ying Huang for the really helpful brainstorming discussions on the
> >> swap_read_folio() plug design.
> >>
> >> Suggested-by: Ying Huang <ying.huang@xxxxxxxxx>
> >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
> >> ---
> >>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
> >>   mm/shmem.c      |   2 +-
> >>   mm/swap.h       |  12 ++--
> >>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> >>   mm/swapfile.c   |   2 +-
> >>   5 files changed, 299 insertions(+), 61 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5745b9ffdf7..9655b85fc243 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> >>       return 0;
> >>   }
> >>   +/*
> >> + * swapin readahead based batching interface for zswap batched loads using IAA:
> >> + *
> >> + * Should only be called for and if the faulting swap entry in do_swap_page
> >> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> >> + *
> >> + * Detect if the folio is in the swapcache, is still mapped to only this
> >> + * process, and further, there are no additional references to this folio
> >> + * (for e.g. if another process simultaneously readahead this swap entry
> >> + * while this process was handling the page-fault, and got a pointer to the
> >> + * folio allocated by this process in the swapcache), besides the references
> >> + * that were obtained within __read_swap_cache_async() by this process that is
> >> + * faulting in this single-mapped swap entry.
> >> + */
> >
> > How is this supposed to work for large folios?
> >
>
> Hi,
>
> I was looking at zswapin large folio support and have posted a RFC in [1].
> I got bogged down with some prod stuff, so wasn't able to send it earlier.
>
> It looks quite different, and I think simpler from this series, so might be
> a good comparison.
>
> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@xxxxxxxxx/
>
> Thanks,
> Usama

I agree.

I think the lower hanging fruit here is to build upon Usama's patch.
Kanchana, do you think we can just use the new batch decompressing
infrastructure, and apply it to Usama's large folio zswap loading?

I'm not denying the readahead idea outright, but that seems much more
complicated. There are questions regarding the benefits of
readahead-ing when apply to zswap in the first place - IIUC, zram
circumvents that logic in several cases, and zswap shares many
characteristics with zram (fast, synchronous compression devices).

So let's reap the low hanging fruits first, get the wins as well as
stress test the new infrastructure. Then we can discuss the readahead
idea later?