On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 21/10/2024 06:49, Barry Song wrote: > > On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> At time of folio allocation, alloc_swap_folio checks if the entire > >> folio is in zswap to determine folio order. > >> During swap_read_folio, zswap_load will check if the entire folio > >> is in zswap, and if it is, it will iterate through the pages in > >> folio and decompress them. > >> This will mean the benefits of large folios (fewer page faults, batched > >> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64 > >> and amd) are not lost at swap out when using zswap. > >> This patch does not add support for hybrid backends (i.e. folios > >> partly present swap and zswap). > >> > >> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx> > >> --- > >> mm/memory.c | 13 +++------- > >> mm/zswap.c | 68 ++++++++++++++++++++++++----------------------------- > >> 2 files changed, 34 insertions(+), 47 deletions(-) > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index 49d243131169..75f7b9f5fb32 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > >> > >> /* > >> * swap_read_folio() can't handle the case a large folio is hybridly > >> - * from different backends. And they are likely corner cases. Similar > >> - * things might be added once zswap support large folios. > >> + * from different backends. And they are likely corner cases. > >> */ > >> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) > >> return false; > >> if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) > >> return false; > >> + if (unlikely(!zswap_present_test(entry, nr_pages))) > >> + return false; > >> > >> return true; > >> } > >> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > >> if (unlikely(userfaultfd_armed(vma))) > >> goto fallback; > >> > >> - /* > >> - * A large swapped out folio could be partially or fully in zswap. We > >> - * lack handling for such cases, so fallback to swapping in order-0 > >> - * folio. > >> - */ > >> - if (!zswap_never_enabled()) > >> - goto fallback; > >> - > >> entry = pte_to_swp_entry(vmf->orig_pte); > >> /* > >> * Get a list of all the (large) orders below PMD_ORDER that are enabled > >> diff --git a/mm/zswap.c b/mm/zswap.c > >> index 9cc91ae31116..a5aa86c24060 100644 > >> --- a/mm/zswap.c > >> +++ b/mm/zswap.c > >> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages) > >> > >> bool zswap_load(struct folio *folio) > >> { > >> + int nr_pages = folio_nr_pages(folio); > >> swp_entry_t swp = folio->swap; > >> + unsigned int type = swp_type(swp); > >> pgoff_t offset = swp_offset(swp); > >> bool swapcache = folio_test_swapcache(folio); > >> - struct xarray *tree = swap_zswap_tree(swp); > >> + struct xarray *tree; > >> struct zswap_entry *entry; > >> + int i; > >> > >> VM_WARN_ON_ONCE(!folio_test_locked(folio)); > >> > >> if (zswap_never_enabled()) > >> return false; > >> > >> - /* > >> - * Large folios should not be swapped in while zswap is being used, as > >> - * they are not properly handled. Zswap does not properly load large > >> - * folios, and a large folio may only be partially in zswap. > >> - * > >> - * Return true without marking the folio uptodate so that an IO error is > >> - * emitted (e.g. do_swap_page() will sigbus). > >> - */ > >> - if (WARN_ON_ONCE(folio_test_large(folio))) > >> - return true; > >> - > >> - /* > >> - * When reading into the swapcache, invalidate our entry. The > >> - * swapcache can be the authoritative owner of the page and > >> - * its mappings, and the pressure that results from having two > >> - * in-memory copies outweighs any benefits of caching the > >> - * compression work. > >> - * > >> - * (Most swapins go through the swapcache. The notable > >> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO > >> - * files, which reads into a private page and may free it if > >> - * the fault fails. We remain the primary owner of the entry.) > >> - */ > >> - if (swapcache) > >> - entry = xa_erase(tree, offset); > >> - else > >> - entry = xa_load(tree, offset); > >> - > >> - if (!entry) > >> + if (!zswap_present_test(folio->swap, nr_pages)) > >> return false; > > > > Hi Usama, > > > > Is there any chance that zswap_present_test() returns true > > in do_swap_page() but false in zswap_load()? If that’s > > possible, could we be missing something? For example, > > could it be that zswap has been partially released (with > > part of it still present) during an mTHP swap-in? > > > > If this happens with an mTHP, my understanding is that > > we shouldn't proceed with reading corrupted data from the > > disk backend. > > > > If its not swapcache, the zswap entry is not deleted so I think > it should be ok? > > We can check over here if the entire folio is in zswap, > and if not, return true without marking the folio uptodate > to give an error. We have swapcache_prepare() called in do_swap_page(), which should have protected these entries from being partially freed by other processes (for example, if someone falls back to small folios for the same address). Therefore, I believe that zswap_present_test() cannot be false for mTHP in the current case where only synchronous I/O is supported. the below might help detect the bug? if (!zswap_present_test(folio->swap, nr_pages)) { if (WARN_ON_ONCE(nr_pages > 1)) return true; return false; } the code seems quite ugly :-) do we have some way to unify the code for large and small folios? not quite sure about shmem though.... > > > >> > >> - zswap_decompress(entry, &folio->page); > >> + for (i = 0; i < nr_pages; ++i) { > >> + tree = swap_zswap_tree(swp_entry(type, offset + i)); > >> + /* > >> + * When reading into the swapcache, invalidate our entry. The > >> + * swapcache can be the authoritative owner of the page and > >> + * its mappings, and the pressure that results from having two > >> + * in-memory copies outweighs any benefits of caching the > >> + * compression work. > >> + * > >> + * (Swapins with swap count > 1 go through the swapcache. > >> + * For swap count == 1, the swapcache is skipped and we > >> + * remain the primary owner of the entry.) > >> + */ > >> + if (swapcache) > >> + entry = xa_erase(tree, offset + i); > >> + else > >> + entry = xa_load(tree, offset + i); > >> > >> - count_vm_event(ZSWPIN); > >> - if (entry->objcg) > >> - count_objcg_events(entry->objcg, ZSWPIN, 1); > >> + zswap_decompress(entry, folio_page(folio, i)); > >> > >> - if (swapcache) { > >> - zswap_entry_free(entry); > >> - folio_mark_dirty(folio); > >> + if (entry->objcg) > >> + count_objcg_events(entry->objcg, ZSWPIN, 1); > >> + if (swapcache) > >> + zswap_entry_free(entry); > >> } > >> > >> + count_vm_events(ZSWPIN, nr_pages); > >> + if (swapcache) > >> + folio_mark_dirty(folio); > >> + > >> folio_mark_uptodate(folio); > >> return true; > >> } > >> -- > >> 2.43.5 > >> > > Thanks barry