On Mon, Oct 21, 2024 at 1:57 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > > > > On 21/10/2024 21:28, Barry Song wrote: > > On Tue, Oct 22, 2024 at 1:21 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >> > >> > >> > >> On 21/10/2024 11:55, Barry Song wrote: > >>> On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 21/10/2024 06:49, Barry Song wrote: > >>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote: > >>>>>> > >>>>>> At time of folio allocation, alloc_swap_folio checks if the entire > >>>>>> folio is in zswap to determine folio order. > >>>>>> During swap_read_folio, zswap_load will check if the entire folio > >>>>>> is in zswap, and if it is, it will iterate through the pages in > >>>>>> folio and decompress them. > >>>>>> This will mean the benefits of large folios (fewer page faults, batched > >>>>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64 > >>>>>> and amd) are not lost at swap out when using zswap. > >>>>>> This patch does not add support for hybrid backends (i.e. folios > >>>>>> partly present swap and zswap). > >>>>>> > >>>>>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx> > >>>>>> --- > >>>>>> mm/memory.c | 13 +++------- > >>>>>> mm/zswap.c | 68 ++++++++++++++++++++++++----------------------------- > >>>>>> 2 files changed, 34 insertions(+), 47 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/memory.c b/mm/memory.c > >>>>>> index 49d243131169..75f7b9f5fb32 100644 > >>>>>> --- a/mm/memory.c > >>>>>> +++ b/mm/memory.c > >>>>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) > >>>>>> > >>>>>> /* > >>>>>> * swap_read_folio() can't handle the case a large folio is hybridly > >>>>>> - * from different backends. And they are likely corner cases. Similar > >>>>>> - * things might be added once zswap support large folios. > >>>>>> + * from different backends. And they are likely corner cases. > >>>>>> */ > >>>>>> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) > >>>>>> return false; > >>>>>> if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) > >>>>>> return false; > >>>>>> + if (unlikely(!zswap_present_test(entry, nr_pages))) > >>>>>> + return false; Hmm if the entire folio is not in zswap, this will prevent the large folio swapin, right? Also, I think this is racy, see the comments below and in patch 1. > >>>>>> > >>>>>> return true; > >>>>>> } > >>>>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > >>>>>> if (unlikely(userfaultfd_armed(vma))) > >>>>>> goto fallback; > >>>>>> > >>>>>> - /* > >>>>>> - * A large swapped out folio could be partially or fully in zswap. We > >>>>>> - * lack handling for such cases, so fallback to swapping in order-0 > >>>>>> - * folio. > >>>>>> - */ > >>>>>> - if (!zswap_never_enabled()) > >>>>>> - goto fallback; > >>>>>> - > >>>>>> entry = pte_to_swp_entry(vmf->orig_pte); > >>>>>> /* > >>>>>> * Get a list of all the (large) orders below PMD_ORDER that are enabled > >>>>>> diff --git a/mm/zswap.c b/mm/zswap.c > >>>>>> index 9cc91ae31116..a5aa86c24060 100644 > >>>>>> --- a/mm/zswap.c > >>>>>> +++ b/mm/zswap.c > >>>>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages) > >>>>>> > >>>>>> bool zswap_load(struct folio *folio) > >>>>>> { > >>>>>> + int nr_pages = folio_nr_pages(folio); > >>>>>> swp_entry_t swp = folio->swap; > >>>>>> + unsigned int type = swp_type(swp); > >>>>>> pgoff_t offset = swp_offset(swp); > >>>>>> bool swapcache = folio_test_swapcache(folio); > >>>>>> - struct xarray *tree = swap_zswap_tree(swp); > >>>>>> + struct xarray *tree; > >>>>>> struct zswap_entry *entry; > >>>>>> + int i; > >>>>>> > >>>>>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); > >>>>>> > >>>>>> if (zswap_never_enabled()) > >>>>>> return false; > >>>>>> > >>>>>> - /* > >>>>>> - * Large folios should not be swapped in while zswap is being used, as > >>>>>> - * they are not properly handled. Zswap does not properly load large > >>>>>> - * folios, and a large folio may only be partially in zswap. > >>>>>> - * > >>>>>> - * Return true without marking the folio uptodate so that an IO error is > >>>>>> - * emitted (e.g. do_swap_page() will sigbus). > >>>>>> - */ > >>>>>> - if (WARN_ON_ONCE(folio_test_large(folio))) > >>>>>> - return true; > >>>>>> - > >>>>>> - /* > >>>>>> - * When reading into the swapcache, invalidate our entry. The > >>>>>> - * swapcache can be the authoritative owner of the page and > >>>>>> - * its mappings, and the pressure that results from having two > >>>>>> - * in-memory copies outweighs any benefits of caching the > >>>>>> - * compression work. > >>>>>> - * > >>>>>> - * (Most swapins go through the swapcache. The notable > >>>>>> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO > >>>>>> - * files, which reads into a private page and may free it if > >>>>>> - * the fault fails. We remain the primary owner of the entry.) > >>>>>> - */ > >>>>>> - if (swapcache) > >>>>>> - entry = xa_erase(tree, offset); > >>>>>> - else > >>>>>> - entry = xa_load(tree, offset); > >>>>>> - > >>>>>> - if (!entry) > >>>>>> + if (!zswap_present_test(folio->swap, nr_pages)) > >>>>>> return false; > >>>>> > >>>>> Hi Usama, > >>>>> > >>>>> Is there any chance that zswap_present_test() returns true > >>>>> in do_swap_page() but false in zswap_load()? If that’s > >>>>> possible, could we be missing something? For example, > >>>>> could it be that zswap has been partially released (with > >>>>> part of it still present) during an mTHP swap-in? As I mentioned in patch 1, we need to document when the result of zswap_present_test() is stable, and we can't race with other stores, exclusive loads, writeback, or invalidation. > >>>>> > >>>>> If this happens with an mTHP, my understanding is that > >>>>> we shouldn't proceed with reading corrupted data from the > >>>>> disk backend. > >>>>> > >>>> > >>>> If its not swapcache, the zswap entry is not deleted so I think > >>>> it should be ok? Can we race with things like writeback and other exclusive loads because swapcache_prepare() is not called yet? > >>>> > >>>> We can check over here if the entire folio is in zswap, > >>>> and if not, return true without marking the folio uptodate > >>>> to give an error. > >>> > >>> We have swapcache_prepare() called in do_swap_page(), which should > >>> have protected these entries from being partially freed by other processes > >>> (for example, if someone falls back to small folios for the same address). > >>> Therefore, I believe that zswap_present_test() cannot be false for mTHP in > >>> the current case where only synchronous I/O is supported. > >>> > >>> the below might help detect the bug? > >>> > >>> if (!zswap_present_test(folio->swap, nr_pages)) { > >>> if (WARN_ON_ONCE(nr_pages > 1)) > >>> return true; > >>> return false; > >>> } > >>> > >> > >> I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap, > >> it should still return false. So would need to check the whole folio if we want to > >> warn. But I think if we are sure the code is ok, it is an unnecessary check. > > > > my point is that zswap_present_test() can't differentiate > > 1. the *whole* folio is not in zswap > > 2. the folio is *partially* not in zswap > > > > in case 2, returning false is wrong. > > > > Agreed! > > > And when nr_pages > 1, we have already confirmed earlier in > > do_swap_page() that zswap_present_test() is true. At this point, > > it must always be true; if it's false, it indicates a bug. > > > > Yes agreed! I was thinking from just zswap_load perspective irrespective > of who calls it. > If someone adds large folio support to swapin_readahead, then I think the > above warn might be an issue. > > But just with this patch series, doing what you suggested is correct. I > will add it in next revision. We can deal with it once swap count > 1, > starts supporting large folios. I think I don't follow this part of the conversation properly, but it seems like we want to catch the case where we end up in zswap_load() and only part of the folio is in zswap. Can we use something like the approach we used for swap_zeromap_batch()?