Hi Usama, Nhat, > -----Original Message----- > From: Nhat Pham <nphamcs@xxxxxxxxx> > Sent: Friday, October 18, 2024 10:21 AM > To: Usama Arif <usamaarif642@xxxxxxxxx> > Cc: David Hildenbrand <david@xxxxxxxxxx>; Sridhar, Kanchana P > <kanchana.p.sridhar@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux- > mm@xxxxxxxxx; hannes@xxxxxxxxxxx; yosryahmed@xxxxxxxxxx; > chengming.zhou@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > hughd@xxxxxxxxxx; willy@xxxxxxxxxxxxx; bfoster@xxxxxxxxxx; > dchinner@xxxxxxxxxx; chrisl@xxxxxxxxxx; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls > swapin_readahead() zswap load batching interface. > > On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@xxxxxxxxx> > wrote: > > > > > > On 18/10/2024 08:26, David Hildenbrand wrote: > > > On 18.10.24 08:48, Kanchana P Sridhar wrote: > > >> This patch invokes the swapin_readahead() based batching interface to > > >> prefetch a batch of 4K folios for zswap load with batch decompressions > > >> in parallel using IAA hardware. swapin_readahead() prefetches folios > based > > >> on vm.page-cluster and the usefulness of prior prefetches to the > > >> workload. As folios are created in the swapcache and the readahead > code > > >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", > the > > >> respective folio_batches get populated with the folios to be read. > > >> > > >> Finally, the swapin_readahead() procedures will call the newly added > > >> process_ra_batch_of_same_type() which: > > >> > > >> 1) Reads all the non_zswap_batch folios sequentially by calling > > >> swap_read_folio(). > > >> 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which > calls > > >> zswap_finish_load_batch() that finally decompresses each > > >> SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a > prefetch > > >> batch of say, 32 folios) in parallel with IAA. > > >> > > >> Within do_swap_page(), we try to benefit from batch decompressions in > both > > >> these scenarios: > > >> > > >> 1) single-mapped, SWP_SYNCHRONOUS_IO: > > >> We call swapin_readahead() with "single_mapped_path = true". This > is > > >> done only in the !zswap_never_enabled() case. > > >> 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios: > > >> We call swapin_readahead() with "single_mapped_path = false". > > >> > > >> This will place folios in the swapcache: a design choice that handles > cases > > >> where a folio that is "single-mapped" in process 1 could be prefetched in > > >> process 2; and handles highly contended server scenarios with stability. > > >> There are checks added at the end of do_swap_page(), after the folio has > > >> been successfully loaded, to detect if the single-mapped swapcache folio > is > > >> still single-mapped, and if so, folio_free_swap() is called on the folio. > > >> > > >> Within the swapin_readahead() functions, if single_mapped_path is true, > and > > >> either the platform does not have IAA, or, if the platform has IAA and the > > >> user selects a software compressor for zswap (details of sysfs knob > > >> follow), readahead/batching are skipped and the folio is loaded using > > >> zswap_load(). > > >> > > >> A new swap parameter "singlemapped_ra_enabled" (false by default) is > added > > >> for platforms that have IAA, zswap_load_batching_enabled() is true, and > we > > >> want to give the user the option to run experiments with IAA and with > > >> software compressors for zswap (swap device is > SWP_SYNCHRONOUS_IO): > > >> > > >> For IAA: > > >> echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled > > >> > > >> For software compressors: > > >> echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled > > >> > > >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will > skip > > >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" > do_swap_page() > > >> path. > > >> > > >> Thanks Ying Huang for the really helpful brainstorming discussions on the > > >> swap_read_folio() plug design. > > >> > > >> Suggested-by: Ying Huang <ying.huang@xxxxxxxxx> > > >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> > > >> --- > > >> mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++-- > --------- > > >> mm/shmem.c | 2 +- > > >> mm/swap.h | 12 ++-- > > >> mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++-- > -- > > >> mm/swapfile.c | 2 +- > > >> 5 files changed, 299 insertions(+), 61 deletions(-) > > >> > > >> diff --git a/mm/memory.c b/mm/memory.c > > >> index b5745b9ffdf7..9655b85fc243 100644 > > >> --- a/mm/memory.c > > >> +++ b/mm/memory.c > > >> @@ -3924,6 +3924,42 @@ static vm_fault_t > remove_device_exclusive_entry(struct vm_fault *vmf) > > >> return 0; > > >> } > > >> +/* > > >> + * swapin readahead based batching interface for zswap batched loads > using IAA: > > >> + * > > >> + * Should only be called for and if the faulting swap entry in > do_swap_page > > >> + * is single-mapped and SWP_SYNCHRONOUS_IO. > > >> + * > > >> + * Detect if the folio is in the swapcache, is still mapped to only this > > >> + * process, and further, there are no additional references to this folio > > >> + * (for e.g. if another process simultaneously readahead this swap entry > > >> + * while this process was handling the page-fault, and got a pointer to > the > > >> + * folio allocated by this process in the swapcache), besides the > references > > >> + * that were obtained within __read_swap_cache_async() by this > process that is > > >> + * faulting in this single-mapped swap entry. > > >> + */ > > > > > > How is this supposed to work for large folios? > > > > > > > Hi, > > > > I was looking at zswapin large folio support and have posted a RFC in [1]. > > I got bogged down with some prod stuff, so wasn't able to send it earlier. > > > > It looks quite different, and I think simpler from this series, so might be > > a good comparison. > > > > [1] https://lore.kernel.org/all/20241018105026.2521366-1- > usamaarif642@xxxxxxxxx/ > > > > Thanks, > > Usama > > I agree. > > I think the lower hanging fruit here is to build upon Usama's patch. > Kanchana, do you think we can just use the new batch decompressing > infrastructure, and apply it to Usama's large folio zswap loading? > > I'm not denying the readahead idea outright, but that seems much more > complicated. There are questions regarding the benefits of > readahead-ing when apply to zswap in the first place - IIUC, zram > circumvents that logic in several cases, and zswap shares many > characteristics with zram (fast, synchronous compression devices). > > So let's reap the low hanging fruits first, get the wins as well as > stress test the new infrastructure. Then we can discuss the readahead > idea later? Thanks Usama for publishing the zswap large folios swapin series, and thanks Nhat for your suggestions. Sure, I can look into integrating the new batch decompressing infrastructure with Usama's large folio zswap loading. However, I think we need to get clarity on a bigger question: does it make sense to swapin large folios? Some important considerations would be: 1) What are the tradeoffs in memory footprint cost of swapping in a large folio? 2) If we decide to let the user determine this by say, an option that determines the swapin granularity (e.g. no more than 32k at a time), how does this constrain compression and zpool storage granularity? Ultimately, I feel the bigger question is about memory utilization cost of large folio swapin. The swapin_readahead() based approach tries to use the prefetch-usefulness characteristics of the workload to improve the efficiency of multiple 4k folios by using strategies like parallel decompression, to strike some balance in memory utilization vs. efficiency. Usama, I downloaded your patch-series and tried to understand this better, and wanted to share the data. I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0, and 16k/32k/64k large folios enabled to "always": 16k/32k/64k folios: kernel compilation with zstd: ================================================= ------------------------------------------------------------------------------ mm-unstable-10-16-2024 + zswap large folios swapin series ------------------------------------------------------------------------------ zswap compressor zstd zstd vm.page-cluster 0 0 ------------------------------------------------------------------------------ real_sec 772.53 870.61 user_sec 15,780.29 15,836.71 sys_sec 5,353.20 6,185.02 Max_Res_Set_Size_KB 1,873,348 1,873,004 ------------------------------------------------------------------------------ memcg_high 0 0 memcg_swap_fail 0 0 zswpout 93,811,916 111,663,872 zswpin 27,150,029 54,730,678 pswpout 64 59 pswpin 78 53 thp_swpout 0 0 thp_swpout_fallback 0 0 16kB-mthp_swpout_fallback 0 0 32kB-mthp_swpout_fallback 0 0 64kB-mthp_swpout_fallback 5,470 0 pgmajfault 29,019,256 16,615,820 swap_ra 0 0 swap_ra_hit 3,004 3,614 ZSWPOUT-16kB 1,324,160 2,252,747 ZSWPOUT-32kB 730,534 1,356,640 ZSWPOUT-64kB 3,039,760 3,955,034 ZSWPIN-16kB 1,496,916 ZSWPIN-32kB 1,131,176 ZSWPIN-64kB 1,866,884 SWPOUT-16kB 0 0 SWPOUT-32kB 0 0 SWPOUT-64kB 4 3 ------------------------------------------------------------------------------ It does appear like there is considerably higher swapout and swapin activity as a result of swapping in large folios, which does end up impacting performance. I would appreciate thoughts on understanding the usefulness of swapping in large folios, with the considerations outlined earlier/other factors. Thanks, Kanchana