On 21/10/2024 21:28, Barry Song wrote: > On Tue, Oct 22, 2024 at 1:21 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >> >> >> >> On 21/10/2024 11:55, Barry Song wrote: >>> On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>> >>>> >>>> >>>> On 21/10/2024 06:49, Barry Song wrote: >>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: >>>>>> >>>>>> At time of folio allocation, alloc_swap_folio checks if the entire >>>>>> folio is in zswap to determine folio order. >>>>>> During swap_read_folio, zswap_load will check if the entire folio >>>>>> is in zswap, and if it is, it will iterate through the pages in >>>>>> folio and decompress them. >>>>>> This will mean the benefits of large folios (fewer page faults, batched >>>>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64 >>>>>> and amd) are not lost at swap out when using zswap. >>>>>> This patch does not add support for hybrid backends (i.e. folios >>>>>> partly present swap and zswap). >>>>>> >>>>>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx> >>>>>> --- >>>>>> mm/memory.c | 13 +++------- >>>>>> mm/zswap.c | 68 ++++++++++++++++++++++++----------------------------- >>>>>> 2 files changed, 34 insertions(+), 47 deletions(-) >>>>>> >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index 49d243131169..75f7b9f5fb32 100644 >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) >>>>>> >>>>>> /* >>>>>> * swap_read_folio() can't handle the case a large folio is hybridly >>>>>> - * from different backends. And they are likely corner cases. Similar >>>>>> - * things might be added once zswap support large folios. >>>>>> + * from different backends. And they are likely corner cases. >>>>>> */ >>>>>> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) >>>>>> return false; >>>>>> if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) >>>>>> return false; >>>>>> + if (unlikely(!zswap_present_test(entry, nr_pages))) >>>>>> + return false; >>>>>> >>>>>> return true; >>>>>> } >>>>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) >>>>>> if (unlikely(userfaultfd_armed(vma))) >>>>>> goto fallback; >>>>>> >>>>>> - /* >>>>>> - * A large swapped out folio could be partially or fully in zswap. We >>>>>> - * lack handling for such cases, so fallback to swapping in order-0 >>>>>> - * folio. >>>>>> - */ >>>>>> - if (!zswap_never_enabled()) >>>>>> - goto fallback; >>>>>> - >>>>>> entry = pte_to_swp_entry(vmf->orig_pte); >>>>>> /* >>>>>> * Get a list of all the (large) orders below PMD_ORDER that are enabled >>>>>> diff --git a/mm/zswap.c b/mm/zswap.c >>>>>> index 9cc91ae31116..a5aa86c24060 100644 >>>>>> --- a/mm/zswap.c >>>>>> +++ b/mm/zswap.c >>>>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages) >>>>>> >>>>>> bool zswap_load(struct folio *folio) >>>>>> { >>>>>> + int nr_pages = folio_nr_pages(folio); >>>>>> swp_entry_t swp = folio->swap; >>>>>> + unsigned int type = swp_type(swp); >>>>>> pgoff_t offset = swp_offset(swp); >>>>>> bool swapcache = folio_test_swapcache(folio); >>>>>> - struct xarray *tree = swap_zswap_tree(swp); >>>>>> + struct xarray *tree; >>>>>> struct zswap_entry *entry; >>>>>> + int i; >>>>>> >>>>>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); >>>>>> >>>>>> if (zswap_never_enabled()) >>>>>> return false; >>>>>> >>>>>> - /* >>>>>> - * Large folios should not be swapped in while zswap is being used, as >>>>>> - * they are not properly handled. Zswap does not properly load large >>>>>> - * folios, and a large folio may only be partially in zswap. >>>>>> - * >>>>>> - * Return true without marking the folio uptodate so that an IO error is >>>>>> - * emitted (e.g. do_swap_page() will sigbus). >>>>>> - */ >>>>>> - if (WARN_ON_ONCE(folio_test_large(folio))) >>>>>> - return true; >>>>>> - >>>>>> - /* >>>>>> - * When reading into the swapcache, invalidate our entry. The >>>>>> - * swapcache can be the authoritative owner of the page and >>>>>> - * its mappings, and the pressure that results from having two >>>>>> - * in-memory copies outweighs any benefits of caching the >>>>>> - * compression work. >>>>>> - * >>>>>> - * (Most swapins go through the swapcache. The notable >>>>>> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO >>>>>> - * files, which reads into a private page and may free it if >>>>>> - * the fault fails. We remain the primary owner of the entry.) >>>>>> - */ >>>>>> - if (swapcache) >>>>>> - entry = xa_erase(tree, offset); >>>>>> - else >>>>>> - entry = xa_load(tree, offset); >>>>>> - >>>>>> - if (!entry) >>>>>> + if (!zswap_present_test(folio->swap, nr_pages)) >>>>>> return false; >>>>> >>>>> Hi Usama, >>>>> >>>>> Is there any chance that zswap_present_test() returns true >>>>> in do_swap_page() but false in zswap_load()? If that’s >>>>> possible, could we be missing something? For example, >>>>> could it be that zswap has been partially released (with >>>>> part of it still present) during an mTHP swap-in? >>>>> >>>>> If this happens with an mTHP, my understanding is that >>>>> we shouldn't proceed with reading corrupted data from the >>>>> disk backend. >>>>> >>>> >>>> If its not swapcache, the zswap entry is not deleted so I think >>>> it should be ok? >>>> >>>> We can check over here if the entire folio is in zswap, >>>> and if not, return true without marking the folio uptodate >>>> to give an error. >>> >>> We have swapcache_prepare() called in do_swap_page(), which should >>> have protected these entries from being partially freed by other processes >>> (for example, if someone falls back to small folios for the same address). >>> Therefore, I believe that zswap_present_test() cannot be false for mTHP in >>> the current case where only synchronous I/O is supported. >>> >>> the below might help detect the bug? >>> >>> if (!zswap_present_test(folio->swap, nr_pages)) { >>> if (WARN_ON_ONCE(nr_pages > 1)) >>> return true; >>> return false; >>> } >>> >> >> I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap, >> it should still return false. So would need to check the whole folio if we want to >> warn. But I think if we are sure the code is ok, it is an unnecessary check. > > my point is that zswap_present_test() can't differentiate > 1. the *whole* folio is not in zswap > 2. the folio is *partially* not in zswap > > in case 2, returning false is wrong. > Agreed! > And when nr_pages > 1, we have already confirmed earlier in > do_swap_page() that zswap_present_test() is true. At this point, > it must always be true; if it's false, it indicates a bug. > Yes agreed! I was thinking from just zswap_load perspective irrespective of who calls it. If someone adds large folio support to swapin_readahead, then I think the above warn might be an issue. But just with this patch series, doing what you suggested is correct. I will add it in next revision. We can deal with it once swap count > 1, starts supporting large folios. >> >>> the code seems quite ugly :-) do we have some way to unify the code >>> for large and small folios? >>> >>> not quite sure about shmem though.... >>> >> >> If its shmem, and the swap_count goes to 1, I think its still ok? because >> then the folio will be gotten from swap_cache_get_folio if it has already >> been in swapcache. >> >>>> >>>> >>>>>> >>>>>> - zswap_decompress(entry, &folio->page); >>>>>> + for (i = 0; i < nr_pages; ++i) { >>>>>> + tree = swap_zswap_tree(swp_entry(type, offset + i)); >>>>>> + /* >>>>>> + * When reading into the swapcache, invalidate our entry. The >>>>>> + * swapcache can be the authoritative owner of the page and >>>>>> + * its mappings, and the pressure that results from having two >>>>>> + * in-memory copies outweighs any benefits of caching the >>>>>> + * compression work. >>>>>> + * >>>>>> + * (Swapins with swap count > 1 go through the swapcache. >>>>>> + * For swap count == 1, the swapcache is skipped and we >>>>>> + * remain the primary owner of the entry.) >>>>>> + */ >>>>>> + if (swapcache) >>>>>> + entry = xa_erase(tree, offset + i); >>>>>> + else >>>>>> + entry = xa_load(tree, offset + i); >>>>>> >>>>>> - count_vm_event(ZSWPIN); >>>>>> - if (entry->objcg) >>>>>> - count_objcg_events(entry->objcg, ZSWPIN, 1); >>>>>> + zswap_decompress(entry, folio_page(folio, i)); >>>>>> >>>>>> - if (swapcache) { >>>>>> - zswap_entry_free(entry); >>>>>> - folio_mark_dirty(folio); >>>>>> + if (entry->objcg) >>>>>> + count_objcg_events(entry->objcg, ZSWPIN, 1); >>>>>> + if (swapcache) >>>>>> + zswap_entry_free(entry); >>>>>> } >>>>>> >>>>>> + count_vm_events(ZSWPIN, nr_pages); >>>>>> + if (swapcache) >>>>>> + folio_mark_dirty(folio); >>>>>> + >>>>>> folio_mark_uptodate(folio); >>>>>> return true; >>>>>> } >>>>>> -- >>>>>> 2.43.5 >>>>>> >>>>> >>> >>> Thanks >>> barry >>