On 21/10/2024 11:55, Barry Song wrote: > On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> >> >> On 21/10/2024 06:49, Barry Song wrote: >>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>> >>>> At time of folio allocation, alloc_swap_folio checks if the entire >>>> folio is in zswap to determine folio order. >>>> During swap_read_folio, zswap_load will check if the entire folio >>>> is in zswap, and if it is, it will iterate through the pages in >>>> folio and decompress them. >>>> This will mean the benefits of large folios (fewer page faults, batched >>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64 >>>> and amd) are not lost at swap out when using zswap. >>>> This patch does not add support for hybrid backends (i.e. folios >>>> partly present swap and zswap). >>>> >>>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx> >>>> --- >>>> mm/memory.c | 13 +++------- >>>> mm/zswap.c | 68 ++++++++++++++++++++++++----------------------------- >>>> 2 files changed, 34 insertions(+), 47 deletions(-) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 49d243131169..75f7b9f5fb32 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) >>>> >>>> /* >>>> * swap_read_folio() can't handle the case a large folio is hybridly >>>> - * from different backends. And they are likely corner cases. Similar >>>> - * things might be added once zswap support large folios. >>>> + * from different backends. And they are likely corner cases. >>>> */ >>>> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) >>>> return false; >>>> if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) >>>> return false; >>>> + if (unlikely(!zswap_present_test(entry, nr_pages))) >>>> + return false; >>>> >>>> return true; >>>> } >>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) >>>> if (unlikely(userfaultfd_armed(vma))) >>>> goto fallback; >>>> >>>> - /* >>>> - * A large swapped out folio could be partially or fully in zswap. We >>>> - * lack handling for such cases, so fallback to swapping in order-0 >>>> - * folio. >>>> - */ >>>> - if (!zswap_never_enabled()) >>>> - goto fallback; >>>> - >>>> entry = pte_to_swp_entry(vmf->orig_pte); >>>> /* >>>> * Get a list of all the (large) orders below PMD_ORDER that are enabled >>>> diff --git a/mm/zswap.c b/mm/zswap.c >>>> index 9cc91ae31116..a5aa86c24060 100644 >>>> --- a/mm/zswap.c >>>> +++ b/mm/zswap.c >>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages) >>>> >>>> bool zswap_load(struct folio *folio) >>>> { >>>> + int nr_pages = folio_nr_pages(folio); >>>> swp_entry_t swp = folio->swap; >>>> + unsigned int type = swp_type(swp); >>>> pgoff_t offset = swp_offset(swp); >>>> bool swapcache = folio_test_swapcache(folio); >>>> - struct xarray *tree = swap_zswap_tree(swp); >>>> + struct xarray *tree; >>>> struct zswap_entry *entry; >>>> + int i; >>>> >>>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); >>>> >>>> if (zswap_never_enabled()) >>>> return false; >>>> >>>> - /* >>>> - * Large folios should not be swapped in while zswap is being used, as >>>> - * they are not properly handled. Zswap does not properly load large >>>> - * folios, and a large folio may only be partially in zswap. >>>> - * >>>> - * Return true without marking the folio uptodate so that an IO error is >>>> - * emitted (e.g. do_swap_page() will sigbus). >>>> - */ >>>> - if (WARN_ON_ONCE(folio_test_large(folio))) >>>> - return true; >>>> - >>>> - /* >>>> - * When reading into the swapcache, invalidate our entry. The >>>> - * swapcache can be the authoritative owner of the page and >>>> - * its mappings, and the pressure that results from having two >>>> - * in-memory copies outweighs any benefits of caching the >>>> - * compression work. >>>> - * >>>> - * (Most swapins go through the swapcache. The notable >>>> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO >>>> - * files, which reads into a private page and may free it if >>>> - * the fault fails. We remain the primary owner of the entry.) >>>> - */ >>>> - if (swapcache) >>>> - entry = xa_erase(tree, offset); >>>> - else >>>> - entry = xa_load(tree, offset); >>>> - >>>> - if (!entry) >>>> + if (!zswap_present_test(folio->swap, nr_pages)) >>>> return false; >>> >>> Hi Usama, >>> >>> Is there any chance that zswap_present_test() returns true >>> in do_swap_page() but false in zswap_load()? If that’s >>> possible, could we be missing something? For example, >>> could it be that zswap has been partially released (with >>> part of it still present) during an mTHP swap-in? >>> >>> If this happens with an mTHP, my understanding is that >>> we shouldn't proceed with reading corrupted data from the >>> disk backend. >>> >> >> If its not swapcache, the zswap entry is not deleted so I think >> it should be ok? >> >> We can check over here if the entire folio is in zswap, >> and if not, return true without marking the folio uptodate >> to give an error. > > We have swapcache_prepare() called in do_swap_page(), which should > have protected these entries from being partially freed by other processes > (for example, if someone falls back to small folios for the same address). > Therefore, I believe that zswap_present_test() cannot be false for mTHP in > the current case where only synchronous I/O is supported. > > the below might help detect the bug? > > if (!zswap_present_test(folio->swap, nr_pages)) { > if (WARN_ON_ONCE(nr_pages > 1)) > return true; > return false; > } > I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap, it should still return false. So would need to check the whole folio if we want to warn. But I think if we are sure the code is ok, it is an unnecessary check. > the code seems quite ugly :-) do we have some way to unify the code > for large and small folios? > > not quite sure about shmem though.... > If its shmem, and the swap_count goes to 1, I think its still ok? because then the folio will be gotten from swap_cache_get_folio if it has already been in swapcache. >> >> >>>> >>>> - zswap_decompress(entry, &folio->page); >>>> + for (i = 0; i < nr_pages; ++i) { >>>> + tree = swap_zswap_tree(swp_entry(type, offset + i)); >>>> + /* >>>> + * When reading into the swapcache, invalidate our entry. The >>>> + * swapcache can be the authoritative owner of the page and >>>> + * its mappings, and the pressure that results from having two >>>> + * in-memory copies outweighs any benefits of caching the >>>> + * compression work. >>>> + * >>>> + * (Swapins with swap count > 1 go through the swapcache. >>>> + * For swap count == 1, the swapcache is skipped and we >>>> + * remain the primary owner of the entry.) >>>> + */ >>>> + if (swapcache) >>>> + entry = xa_erase(tree, offset + i); >>>> + else >>>> + entry = xa_load(tree, offset + i); >>>> >>>> - count_vm_event(ZSWPIN); >>>> - if (entry->objcg) >>>> - count_objcg_events(entry->objcg, ZSWPIN, 1); >>>> + zswap_decompress(entry, folio_page(folio, i)); >>>> >>>> - if (swapcache) { >>>> - zswap_entry_free(entry); >>>> - folio_mark_dirty(folio); >>>> + if (entry->objcg) >>>> + count_objcg_events(entry->objcg, ZSWPIN, 1); >>>> + if (swapcache) >>>> + zswap_entry_free(entry); >>>> } >>>> >>>> + count_vm_events(ZSWPIN, nr_pages); >>>> + if (swapcache) >>>> + folio_mark_dirty(folio); >>>> + >>>> folio_mark_uptodate(folio); >>>> return true; >>>> } >>>> -- >>>> 2.43.5 >>>> >>> > > Thanks > barry