Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 11 Jun 2024 08:42:51 -0700

On Tue, Jun 11, 2024 at 4:49 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool
> synchronous,
> >>>>                   psi_memstall_enter(&pflags);
> >>>>           }
> >>>>           delayacct_swapin_start();
> >>>> -
> >>>> -       if (zswap_load(folio)) {
> >>>> +       if (swap_zeromap_folio_test(folio)) {
> >>>> +               folio_zero_fill(folio);
> >>>> +               folio_mark_uptodate(folio);
> >>>> +               folio_unlock(folio);
> >>> We don't currently support swapping in large folios, but it is a work
> >>> in progress, and this will break once we have it.
> >>> swap_zeromap_folio_test() will return false even if parts of the folio
> >>> are in fact zero-filled. Then, we will go read those from disk swap,
> >>> essentially corrupting data.
> >> So yes, with this patch I tested swap out of large zero folio, but when
> >> swapping in it was page by page. My assumption was that swap_read_folio
> >> (when support is added) would only pass a large folio that was earlier
> >> swapped out as a large folio. So if a zero filled large folio was
> >> swapped out, the zeromap will be set for all the pages in that folio,
> >> and at large folio swap in (when it is supported), it will see that all
> >> the bits in the zeromap for that folio are set,  and will just
> >> folio_zero_fill.
> >>
> >> If even a single page in large folio has non-zero data then zeromap will
> >> not store it and it will go to either zswap or disk, and at read time as
> >> all the bits in zeromap are not set, it will still goto either zswap or
> >> disk, so I think this works?
> >>
> >> Is my assumption wrong that only large folios can be swapped in only if
> >> they were swapped out as large? I think this code works in that case.
> > I think the assumption is incorrect. I think we would just check if
> > contiguous PTEs have contiguous swap entries and swapin the folio as a
> > large folio in this case. It is likely that the swap entries are
> > contiguous because it was swapped out as a large folio, but I don't
> > think it's guaranteed.
>
> Yes, makes sense. Thanks for explaining this.
>
> >
> > For example, here is a patch that implements large swapin support for
> > the synchronous swapin case, and I think it just checks that the PTEs
> > have contiguous swap entries:
> > https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@xxxxxxxxx/
> >
> > This makes a lot of sense because otherwise you'd have to keep track
> > of how the folios were composed at the time of swapout, to be able to
> > swap the same folios back in.
>
> I think the solution to large folio swap-in for this optimization and
> zswap is not in swap_read_folio in this patch-series or any call further
> down the stack, as its too late by the time you reach here, but in
> Barrys' patch. This needs to happen much earlier when deciding the size
> of the folio at folio creation in alloc_swap_folio, after Barry checks
>
>      if (is_zswap_enabled())
>          goto fallback;
>
> once the order is decided, we would need to check the indexes in the
> zeromap array starting from swap_offset(entry) and ending at
> swap_offset(entry) + 2^order are set. If no bits are set, or all bits
> are set, then we allocate large folio. Otherwise, we goto fallback.
>
> I think its better to handle this in Barrys patch. I feel this series is
> close to its final state, i.e. the only diff I have for the next
> revision is below to remove start/end_writeback for zer_filled case. I
> will comment on Barrys patch once the I send out the next revision of this.

Sorry I did not make myself clearer. I did not mean that you should
handle the large folio swapin here. This needs to be handled at a
higher level because as you mentioned, a large folio may be partially
in the zeromap, zswap, swapcache, disk, etc.

What I meant is that we should probably have a debug check to make
sure this doesn't go unhandled. For zswap, I am trying to add a
warning and fail the swapin operation if a large folio slips through
to zswap. We can do something similar here if folks agree this is the
right way in the interim:
https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@xxxxxxxxxx/.

Maybe I am too paranoid, but I think it's easy to mess up these things
when working on large folio swapin imo.