Re: [RFC 3/4] mm/zswap: add support for large folio zswapin

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Mon, 21 Oct 2024 14:34:03 -0700

On Mon, Oct 21, 2024 at 1:57 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 21/10/2024 21:28, Barry Song wrote:
> > On Tue, Oct 22, 2024 at 1:21 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 21/10/2024 11:55, Barry Song wrote:
> >>> On Mon, Oct 21, 2024 at 11:44 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 21/10/2024 06:49, Barry Song wrote:
> >>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
> >>>>>>
> >>>>>> At time of folio allocation, alloc_swap_folio checks if the entire
> >>>>>> folio is in zswap to determine folio order.
> >>>>>> During swap_read_folio, zswap_load will check if the entire folio
> >>>>>> is in zswap, and if it is, it will iterate through the pages in
> >>>>>> folio and decompress them.
> >>>>>> This will mean the benefits of large folios (fewer page faults, batched
> >>>>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64
> >>>>>> and amd) are not lost at swap out when using zswap.
> >>>>>> This patch does not add support for hybrid backends (i.e. folios
> >>>>>> partly present swap and zswap).
> >>>>>>
> >>>>>> Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx>
> >>>>>> ---
> >>>>>>  mm/memory.c | 13 +++-------
> >>>>>>  mm/zswap.c  | 68 ++++++++++++++++++++++++-----------------------------
> >>>>>>  2 files changed, 34 insertions(+), 47 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>> index 49d243131169..75f7b9f5fb32 100644
> >>>>>> --- a/mm/memory.c
> >>>>>> +++ b/mm/memory.c
> >>>>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> >>>>>>
> >>>>>>         /*
> >>>>>>          * swap_read_folio() can't handle the case a large folio is hybridly
> >>>>>> -        * from different backends. And they are likely corner cases. Similar
> >>>>>> -        * things might be added once zswap support large folios.
> >>>>>> +        * from different backends. And they are likely corner cases.
> >>>>>>          */
> >>>>>>         if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
> >>>>>>                 return false;
> >>>>>>         if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
> >>>>>>                 return false;
> >>>>>> +       if (unlikely(!zswap_present_test(entry, nr_pages)))
> >>>>>> +               return false;

Hmm if the entire folio is not in zswap, this will prevent the large
folio swapin, right?

Also, I think this is racy, see the comments below and in patch 1.

> >>>>>>
> >>>>>>         return true;
> >>>>>>  }
> >>>>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >>>>>>         if (unlikely(userfaultfd_armed(vma)))
> >>>>>>                 goto fallback;
> >>>>>>
> >>>>>> -       /*
> >>>>>> -        * A large swapped out folio could be partially or fully in zswap. We
> >>>>>> -        * lack handling for such cases, so fallback to swapping in order-0
> >>>>>> -        * folio.
> >>>>>> -        */
> >>>>>> -       if (!zswap_never_enabled())
> >>>>>> -               goto fallback;
> >>>>>> -
> >>>>>>         entry = pte_to_swp_entry(vmf->orig_pte);
> >>>>>>         /*
> >>>>>>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>>>>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>>>>> index 9cc91ae31116..a5aa86c24060 100644
> >>>>>> --- a/mm/zswap.c
> >>>>>> +++ b/mm/zswap.c
> >>>>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages)
> >>>>>>
> >>>>>>  bool zswap_load(struct folio *folio)
> >>>>>>  {
> >>>>>> +       int nr_pages = folio_nr_pages(folio);
> >>>>>>         swp_entry_t swp = folio->swap;
> >>>>>> +       unsigned int type = swp_type(swp);
> >>>>>>         pgoff_t offset = swp_offset(swp);
> >>>>>>         bool swapcache = folio_test_swapcache(folio);
> >>>>>> -       struct xarray *tree = swap_zswap_tree(swp);
> >>>>>> +       struct xarray *tree;
> >>>>>>         struct zswap_entry *entry;
> >>>>>> +       int i;
> >>>>>>
> >>>>>>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >>>>>>
> >>>>>>         if (zswap_never_enabled())
> >>>>>>                 return false;
> >>>>>>
> >>>>>> -       /*
> >>>>>> -        * Large folios should not be swapped in while zswap is being used, as
> >>>>>> -        * they are not properly handled. Zswap does not properly load large
> >>>>>> -        * folios, and a large folio may only be partially in zswap.
> >>>>>> -        *
> >>>>>> -        * Return true without marking the folio uptodate so that an IO error is
> >>>>>> -        * emitted (e.g. do_swap_page() will sigbus).
> >>>>>> -        */
> >>>>>> -       if (WARN_ON_ONCE(folio_test_large(folio)))
> >>>>>> -               return true;
> >>>>>> -
> >>>>>> -       /*
> >>>>>> -        * When reading into the swapcache, invalidate our entry. The
> >>>>>> -        * swapcache can be the authoritative owner of the page and
> >>>>>> -        * its mappings, and the pressure that results from having two
> >>>>>> -        * in-memory copies outweighs any benefits of caching the
> >>>>>> -        * compression work.
> >>>>>> -        *
> >>>>>> -        * (Most swapins go through the swapcache. The notable
> >>>>>> -        * exception is the singleton fault on SWP_SYNCHRONOUS_IO
> >>>>>> -        * files, which reads into a private page and may free it if
> >>>>>> -        * the fault fails. We remain the primary owner of the entry.)
> >>>>>> -        */
> >>>>>> -       if (swapcache)
> >>>>>> -               entry = xa_erase(tree, offset);
> >>>>>> -       else
> >>>>>> -               entry = xa_load(tree, offset);
> >>>>>> -
> >>>>>> -       if (!entry)
> >>>>>> +       if (!zswap_present_test(folio->swap, nr_pages))
> >>>>>>                 return false;
> >>>>>
> >>>>> Hi Usama,
> >>>>>
> >>>>> Is there any chance that zswap_present_test() returns true
> >>>>> in do_swap_page() but false in zswap_load()? If that’s
> >>>>> possible, could we be missing something? For example,
> >>>>> could it be that zswap has been partially released (with
> >>>>> part of it still present) during an mTHP swap-in?

As I mentioned in patch 1, we need to document when the result of
zswap_present_test() is stable, and we can't race with other stores,
exclusive loads, writeback, or invalidation.

> >>>>>
> >>>>> If this happens with an mTHP, my understanding is that
> >>>>> we shouldn't proceed with reading corrupted data from the
> >>>>> disk backend.
> >>>>>
> >>>>
> >>>> If its not swapcache, the zswap entry is not deleted so I think
> >>>> it should be ok?

Can we race with things like writeback and other exclusive loads
because swapcache_prepare() is not called yet?

> >>>>
> >>>> We can check over here if the entire folio is in zswap,
> >>>> and if not, return true without marking the folio uptodate
> >>>> to give an error.
> >>>
> >>> We have swapcache_prepare() called in do_swap_page(), which should
> >>> have protected these entries from being partially freed by other processes
> >>> (for example, if someone falls back to small folios for the same address).
> >>> Therefore, I believe that zswap_present_test() cannot be false for mTHP in
> >>> the current case where only synchronous I/O is supported.
> >>>
> >>> the below might help detect the bug?
> >>>
> >>> if (!zswap_present_test(folio->swap, nr_pages)) {
> >>>      if (WARN_ON_ONCE(nr_pages > 1))
> >>>                 return true;
> >>>      return false;
> >>> }
> >>>
> >>
> >> I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap,
> >> it should still return false. So would need to check the whole folio if we want to
> >> warn. But I think if we are sure the code is ok, it is an unnecessary check.
> >
> > my point is that zswap_present_test() can't differentiate
> > 1. the *whole* folio is not in zswap
> > 2. the folio is *partially* not in zswap
> >
> > in case 2, returning false is wrong.
> >
>
> Agreed!
>
> > And when nr_pages > 1, we have already confirmed earlier in
> > do_swap_page() that zswap_present_test() is true. At this point,
> > it must always be true; if it's false, it indicates a bug.
> >
>
> Yes agreed! I was thinking from just zswap_load perspective irrespective
> of who calls it.
> If someone adds large folio support to swapin_readahead, then I think the
> above warn might be an issue.
>
> But just with this patch series, doing what you suggested is correct. I
> will add it in next revision. We can deal with it once swap count > 1,
> starts supporting large folios.

I think I don't follow this part of the conversation properly, but it
seems like we want to catch the case where we end up in zswap_load()
and only part of the folio is in zswap. Can we use something like the
approach we used for swap_zeromap_batch()?