Hi David, > -----Original Message----- > From: David Hildenbrand <david@xxxxxxxxxx> > Sent: Friday, October 18, 2024 12:27 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux- > kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx; > yosryahmed@xxxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > ryan.roberts@xxxxxxx; Huang, Ying <ying.huang@xxxxxxxxx>; > 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; hughd@xxxxxxxxxx; > willy@xxxxxxxxxxxxx; bfoster@xxxxxxxxxx; dchinner@xxxxxxxxxx; > chrisl@xxxxxxxxxx > Cc: Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh > <vinodh.gopal@xxxxxxxxx> > Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls > swapin_readahead() zswap load batching interface. > > On 18.10.24 08:48, Kanchana P Sridhar wrote: > > This patch invokes the swapin_readahead() based batching interface to > > prefetch a batch of 4K folios for zswap load with batch decompressions > > in parallel using IAA hardware. swapin_readahead() prefetches folios based > > on vm.page-cluster and the usefulness of prior prefetches to the > > workload. As folios are created in the swapcache and the readahead code > > calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", > the > > respective folio_batches get populated with the folios to be read. > > > > Finally, the swapin_readahead() procedures will call the newly added > > process_ra_batch_of_same_type() which: > > > > 1) Reads all the non_zswap_batch folios sequentially by calling > > swap_read_folio(). > > 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which > calls > > zswap_finish_load_batch() that finally decompresses each > > SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a > prefetch > > batch of say, 32 folios) in parallel with IAA. > > > > Within do_swap_page(), we try to benefit from batch decompressions in > both > > these scenarios: > > > > 1) single-mapped, SWP_SYNCHRONOUS_IO: > > We call swapin_readahead() with "single_mapped_path = true". This is > > done only in the !zswap_never_enabled() case. > > 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios: > > We call swapin_readahead() with "single_mapped_path = false". > > > > This will place folios in the swapcache: a design choice that handles cases > > where a folio that is "single-mapped" in process 1 could be prefetched in > > process 2; and handles highly contended server scenarios with stability. > > There are checks added at the end of do_swap_page(), after the folio has > > been successfully loaded, to detect if the single-mapped swapcache folio is > > still single-mapped, and if so, folio_free_swap() is called on the folio. > > > > Within the swapin_readahead() functions, if single_mapped_path is true, > and > > either the platform does not have IAA, or, if the platform has IAA and the > > user selects a software compressor for zswap (details of sysfs knob > > follow), readahead/batching are skipped and the folio is loaded using > > zswap_load(). > > > > A new swap parameter "singlemapped_ra_enabled" (false by default) is > added > > for platforms that have IAA, zswap_load_batching_enabled() is true, and we > > want to give the user the option to run experiments with IAA and with > > software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO): > > > > For IAA: > > echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled > > > > For software compressors: > > echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled > > > > If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip > > prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" > do_swap_page() > > path. > > > > Thanks Ying Huang for the really helpful brainstorming discussions on the > > swap_read_folio() plug design. > > > > Suggested-by: Ying Huang <ying.huang@xxxxxxxxx> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx> > > --- > > mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++------ > ----- > > mm/shmem.c | 2 +- > > mm/swap.h | 12 ++-- > > mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++---- > > mm/swapfile.c | 2 +- > > 5 files changed, 299 insertions(+), 61 deletions(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index b5745b9ffdf7..9655b85fc243 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -3924,6 +3924,42 @@ static vm_fault_t > remove_device_exclusive_entry(struct vm_fault *vmf) > > return 0; > > } > > > > +/* > > + * swapin readahead based batching interface for zswap batched loads > using IAA: > > + * > > + * Should only be called for and if the faulting swap entry in do_swap_page > > + * is single-mapped and SWP_SYNCHRONOUS_IO. > > + * > > + * Detect if the folio is in the swapcache, is still mapped to only this > > + * process, and further, there are no additional references to this folio > > + * (for e.g. if another process simultaneously readahead this swap entry > > + * while this process was handling the page-fault, and got a pointer to the > > + * folio allocated by this process in the swapcache), besides the references > > + * that were obtained within __read_swap_cache_async() by this process > that is > > + * faulting in this single-mapped swap entry. > > + */ > > How is this supposed to work for large folios? Thanks for your code review comments. The main idea behind this patch-series is to work with the existing kernel page-fault granularity of 4K folios, that swapin_readahead() builds upon to prefetch other "useful" 4K folios. The intent is to not try to make modifications at page-fault time to opportunistically synthesize large folios for swapin. As we know, __read_swap_cache_async() allocates an order-0 folio, which explains the implementation of should_free_singlemap_swapcache() in this patch. IOW, this is not supposed to work for large folios based on the existing page-fault behavior and without making any modifications to that. > > > +static inline bool should_free_singlemap_swapcache(swp_entry_t entry, > > + struct folio *folio) > > +{ > > + if (!folio_test_swapcache(folio)) > > + return false; > > + > > + if (__swap_count(entry) != 0) > > + return false; > > + > > + /* > > + * The folio ref count for a single-mapped folio that was allocated > > + * in __read_swap_cache_async(), can be a maximum of 3. These are > the > > + * incrementors of the folio ref count in __read_swap_cache_async(): > > + * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru(). > > + */ > > + > > + if (folio_ref_count(folio) <= 3) > > + return true; > > + > > + return false; > > +} > > + > > static inline bool should_try_to_free_swap(struct folio *folio, > > struct vm_area_struct *vma, > > unsigned int fault_flags) > > @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > swp_entry_t entry; > > pte_t pte; > > vm_fault_t ret = 0; > > + bool single_mapped_swapcache = false; > > void *shadow = NULL; > > int nr_pages; > > unsigned long page_idx; > > @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault > *vmf) > > if (!folio) { > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > __swap_count(entry) == 1) { > > - /* skip swapcache */ > > - folio = alloc_swap_folio(vmf); > > - if (folio) { > > - __folio_set_locked(folio); > > - __folio_set_swapbacked(folio); > > - > > - nr_pages = folio_nr_pages(folio); > > - if (folio_test_large(folio)) > > - entry.val = ALIGN_DOWN(entry.val, > nr_pages); > > - /* > > - * Prevent parallel swapin from proceeding > with > > - * the cache flag. Otherwise, another thread > > - * may finish swapin first, free the entry, and > > - * swapout reusing the same entry. It's > > - * undetectable as pte_same() returns true > due > > - * to entry reuse. > > - */ > > - if (swapcache_prepare(entry, nr_pages)) { > > + if (zswap_never_enabled()) { > > + /* skip swapcache */ > > + folio = alloc_swap_folio(vmf); > > + if (folio) { > > + __folio_set_locked(folio); > > + __folio_set_swapbacked(folio); > > + > > + nr_pages = folio_nr_pages(folio); > > + if (folio_test_large(folio)) > > + entry.val = > ALIGN_DOWN(entry.val, nr_pages); > > /* > > - * Relax a bit to prevent rapid > > - * repeated page faults. > > + * Prevent parallel swapin from > proceeding with > > + * the cache flag. Otherwise, another > thread > > + * may finish swapin first, free the > entry, and > > + * swapout reusing the same entry. > It's > > + * undetectable as pte_same() > returns true due > > + * to entry reuse. > > */ > > - add_wait_queue(&swapcache_wq, > &wait); > > - > schedule_timeout_uninterruptible(1); > > - > remove_wait_queue(&swapcache_wq, &wait); > > - goto out_page; > > + if (swapcache_prepare(entry, > nr_pages)) { > > + /* > > + * Relax a bit to prevent rapid > > + * repeated page faults. > > + */ > > + > add_wait_queue(&swapcache_wq, &wait); > > + > schedule_timeout_uninterruptible(1); > > + > remove_wait_queue(&swapcache_wq, &wait); > > + goto out_page; > > + } > > + need_clear_cache = true; > > + > > + > mem_cgroup_swapin_uncharge_swap(entry, nr_pages); > > + > > + shadow = > get_shadow_from_swap_cache(entry); > > + if (shadow) > > + workingset_refault(folio, > shadow); > > + > > + folio_add_lru(folio); > > + > > + /* To provide entry to > swap_read_folio() */ > > + folio->swap = entry; > > + swap_read_folio(folio, NULL, NULL, > NULL); > > + folio->private = NULL; > > + } > > + } else { > > + /* > > + * zswap is enabled or was enabled at some > point. > > + * Don't skip swapcache. > > + * > > + * swapin readahead based batching > interface > > + * for zswap batched loads using IAA: > > + * > > + * Readahead is invoked in this path only if > > + * the sys swap "singlemapped_ra_enabled" > swap > > + * parameter is set to true. By default, > > + * "singlemapped_ra_enabled" is set to false, > > + * the recommended setting for software > compressors. > > + * For IAA, if "singlemapped_ra_enabled" is > set > > + * to true, readahead will be deployed in this > path > > + * as well. > > + * > > + * For single-mapped pages, the batching > interface > > + * calls __read_swap_cache_async() to > allocate and > > + * place the faulting page in the swapcache. > This is > > + * to handle a scenario where the faulting > page in > > + * this process happens to simultaneously be > a > > + * readahead page in another process. By > placing the > > + * single-mapped faulting page in the > swapcache, > > + * we avoid race conditions and duplicate > page > > + * allocations under these scenarios. > > + */ > > + folio = swapin_readahead(entry, > GFP_HIGHUSER_MOVABLE, > > + vmf, true); > > + if (!folio) { > > + ret = VM_FAULT_OOM; > > + goto out; > > } > > - need_clear_cache = true; > > - > > - mem_cgroup_swapin_uncharge_swap(entry, > nr_pages); > > - > > - shadow = > get_shadow_from_swap_cache(entry); > > - if (shadow) > > - workingset_refault(folio, shadow); > > - > > - folio_add_lru(folio); > > > > - /* To provide entry to swap_read_folio() */ > > - folio->swap = entry; > > - swap_read_folio(folio, NULL, NULL, NULL); > > - folio->private = NULL; > > - } > > + single_mapped_swapcache = true; > > + nr_pages = folio_nr_pages(folio); > > + swapcache = folio; > > + } /* swapin with zswap support. */ > > } else { > > folio = swapin_readahead(entry, > GFP_HIGHUSER_MOVABLE, > > - vmf); > > + vmf, false); > > swapcache = folio; > > I'm sorry, but making this function ever more complicated and ugly is > not going to fly. The zswap special casing is quite ugly here as well. > > Is there a way forward that we can make this code actually readable and > avoid zswap special casing? Yes, I realize this is now quite cluttered. I need to think some more about how to make this more readable, and would appreciate suggestions towards this. Thanks, Kanchana > > -- > Cheers, > > David / dhildenb