On Tue, Oct 22, 2024 at 1:21 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 21/10/2024 11:55, Barry Song wrote: > > On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> > >> > >> On 21/10/2024 06:49, Barry Song wrote: > >>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>> > >>>> At time of folio allocation, alloc_swap_folio checks if the entire > >>>> folio is in zswap to determine folio order. > >>>> During swap_read_folio, zswap_load will check if the entire folio > >>>> is in zswap, and if it is, it will iterate through the pages in > >>>> folio and decompress them. > >>>> This will mean the benefits of large folios (fewer page faults, batched > >>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64 > >>>> and amd) are not lost at swap out when using zswap. > >>>> This patch does not add support for hybrid backends (i.e. folios > >>>> partly present swap and zswap). > >>>> > >>>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx> > >>>> --- > >>>> mm/memory.c | 13 +++------- > >>>> mm/zswap.c | 68 ++++++++++++++++++++++++----------------------------- > >>>> 2 files changed, 34 insertions(+), 47 deletions(-) > >>>> > >>>> diff --git a/mm/memory.c b/mm/memory.c > >>>> index 49d243131169..75f7b9f5fb32 100644 > >>>> --- a/mm/memory.c > >>>> +++ b/mm/memory.c > >>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > >>>> > >>>> /* > >>>> * swap_read_folio() can't handle the case a large folio is hybridly > >>>> - * from different backends. And they are likely corner cases. Similar > >>>> - * things might be added once zswap support large folios. > >>>> + * from different backends. And they are likely corner cases. > >>>> */ > >>>> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) > >>>> return false; > >>>> if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) > >>>> return false; > >>>> + if (unlikely(!zswap_present_test(entry, nr_pages))) > >>>> + return false; > >>>> > >>>> return true; > >>>> } > >>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > >>>> if (unlikely(userfaultfd_armed(vma))) > >>>> goto fallback; > >>>> > >>>> - /* > >>>> - * A large swapped out folio could be partially or fully in zswap. We > >>>> - * lack handling for such cases, so fallback to swapping in order-0 > >>>> - * folio. > >>>> - */ > >>>> - if (!zswap_never_enabled()) > >>>> - goto fallback; > >>>> - > >>>> entry = pte_to_swp_entry(vmf->orig_pte); > >>>> /* > >>>> * Get a list of all the (large) orders below PMD_ORDER that are enabled > >>>> diff --git a/mm/zswap.c b/mm/zswap.c > >>>> index 9cc91ae31116..a5aa86c24060 100644 > >>>> --- a/mm/zswap.c > >>>> +++ b/mm/zswap.c > >>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages) > >>>> > >>>> bool zswap_load(struct folio *folio) > >>>> { > >>>> + int nr_pages = folio_nr_pages(folio); > >>>> swp_entry_t swp = folio->swap; > >>>> + unsigned int type = swp_type(swp); > >>>> pgoff_t offset = swp_offset(swp); > >>>> bool swapcache = folio_test_swapcache(folio); > >>>> - struct xarray *tree = swap_zswap_tree(swp); > >>>> + struct xarray *tree; > >>>> struct zswap_entry *entry; > >>>> + int i; > >>>> > >>>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); > >>>> > >>>> if (zswap_never_enabled()) > >>>> return false; > >>>> > >>>> - /* > >>>> - * Large folios should not be swapped in while zswap is being used, as > >>>> - * they are not properly handled. Zswap does not properly load large > >>>> - * folios, and a large folio may only be partially in zswap. > >>>> - * > >>>> - * Return true without marking the folio uptodate so that an IO error is > >>>> - * emitted (e.g. do_swap_page() will sigbus). > >>>> - */ > >>>> - if (WARN_ON_ONCE(folio_test_large(folio))) > >>>> - return true; > >>>> - > >>>> - /* > >>>> - * When reading into the swapcache, invalidate our entry. The > >>>> - * swapcache can be the authoritative owner of the page and > >>>> - * its mappings, and the pressure that results from having two > >>>> - * in-memory copies outweighs any benefits of caching the > >>>> - * compression work. > >>>> - * > >>>> - * (Most swapins go through the swapcache. The notable > >>>> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO > >>>> - * files, which reads into a private page and may free it if > >>>> - * the fault fails. We remain the primary owner of the entry.) > >>>> - */ > >>>> - if (swapcache) > >>>> - entry = xa_erase(tree, offset); > >>>> - else > >>>> - entry = xa_load(tree, offset); > >>>> - > >>>> - if (!entry) > >>>> + if (!zswap_present_test(folio->swap, nr_pages)) > >>>> return false; > >>> > >>> Hi Usama, > >>> > >>> Is there any chance that zswap_present_test() returns true > >>> in do_swap_page() but false in zswap_load()? If that’s > >>> possible, could we be missing something? For example, > >>> could it be that zswap has been partially released (with > >>> part of it still present) during an mTHP swap-in? > >>> > >>> If this happens with an mTHP, my understanding is that > >>> we shouldn't proceed with reading corrupted data from the > >>> disk backend. > >>> > >> > >> If its not swapcache, the zswap entry is not deleted so I think > >> it should be ok? > >> > >> We can check over here if the entire folio is in zswap, > >> and if not, return true without marking the folio uptodate > >> to give an error. > > > > We have swapcache_prepare() called in do_swap_page(), which should > > have protected these entries from being partially freed by other processes > > (for example, if someone falls back to small folios for the same address). > > Therefore, I believe that zswap_present_test() cannot be false for mTHP in > > the current case where only synchronous I/O is supported. > > > > the below might help detect the bug? > > > > if (!zswap_present_test(folio->swap, nr_pages)) { > > if (WARN_ON_ONCE(nr_pages > 1)) > > return true; > > return false; > > } > > > > I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap, > it should still return false. So would need to check the whole folio if we want to > warn. But I think if we are sure the code is ok, it is an unnecessary check. my point is that zswap_present_test() can't differentiate 1. the *whole* folio is not in zswap 2. the folio is *partially* not in zswap in case 2, returning false is wrong. And when nr_pages > 1, we have already confirmed earlier in do_swap_page() that zswap_present_test() is true. At this point, it must always be true; if it's false, it indicates a bug. > > > the code seems quite ugly :-) do we have some way to unify the code > > for large and small folios? > > > > not quite sure about shmem though.... > > > > If its shmem, and the swap_count goes to 1, I think its still ok? because > then the folio will be gotten from swap_cache_get_folio if it has already > been in swapcache. > > >> > >> > >>>> > >>>> - zswap_decompress(entry, &folio->page); > >>>> + for (i = 0; i < nr_pages; ++i) { > >>>> + tree = swap_zswap_tree(swp_entry(type, offset + i)); > >>>> + /* > >>>> + * When reading into the swapcache, invalidate our entry. The > >>>> + * swapcache can be the authoritative owner of the page and > >>>> + * its mappings, and the pressure that results from having two > >>>> + * in-memory copies outweighs any benefits of caching the > >>>> + * compression work. > >>>> + * > >>>> + * (Swapins with swap count > 1 go through the swapcache. > >>>> + * For swap count == 1, the swapcache is skipped and we > >>>> + * remain the primary owner of the entry.) > >>>> + */ > >>>> + if (swapcache) > >>>> + entry = xa_erase(tree, offset + i); > >>>> + else > >>>> + entry = xa_load(tree, offset + i); > >>>> > >>>> - count_vm_event(ZSWPIN); > >>>> - if (entry->objcg) > >>>> - count_objcg_events(entry->objcg, ZSWPIN, 1); > >>>> + zswap_decompress(entry, folio_page(folio, i)); > >>>> > >>>> - if (swapcache) { > >>>> - zswap_entry_free(entry); > >>>> - folio_mark_dirty(folio); > >>>> + if (entry->objcg) > >>>> + count_objcg_events(entry->objcg, ZSWPIN, 1); > >>>> + if (swapcache) > >>>> + zswap_entry_free(entry); > >>>> } > >>>> > >>>> + count_vm_events(ZSWPIN, nr_pages); > >>>> + if (swapcache) > >>>> + folio_mark_dirty(folio); > >>>> + > >>>> folio_mark_uptodate(folio); > >>>> return true; > >>>> } > >>>> -- > >>>> 2.43.5 > >>>> > >>> > > > > Thanks > > barry >