Re: [External] Re: [RFC PATCH] mm: add folio in swapcache if swapin from zswap

Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> · Sat, 23 Mar 2024 16:38:10 +0800

On Sat, Mar 23, 2024 at 7:48 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Fri, Mar 22, 2024 at 10:33:13PM +0000, Yosry Ahmed wrote:
> > On Sat, Mar 23, 2024 at 10:41:32AM +1300, Barry Song wrote:
> > > On Sat, Mar 23, 2024 at 8:38 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@xxxxxxxxx> wrote:
> > > > >
> > > > > From: Chengming Zhou <chengming.zhou@xxxxxxxxx>
> > > > >
> > > > > There is a report of data corruption caused by double swapin, which is
> > > > > only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.
> > > > >
> > > > > The root cause is that zswap is not like other "normal" swap backends,
> > > > > it won't keep the copy of data after the first time of swapin. So if
> > >
> > > I don't quite understand this, so once we load a page from zswap, zswap
> > > will free it even though do_swap_page might not set it to PTE?
> > >
> > > shouldn't zswap free the memory after notify_free just like zram?
> >
> > It's an optimization that zswap has, exclusive loads. After a page is
> > swapped in it can stick around in the swapcache for a while. In this
> > case, there would be two copies in memory with zram (compressed and
> > uncompressed). Zswap implements exclusive loads to drop the compressed
> > copy. The folio is marked as dirty so that any attempts to reclaim it
> > cause a new write (compression) to zswap. It is also for a lot of
> > cleanups and straightforward entry lifetime tracking in zswap.
> >
> > It is mostly fine, the problem here happens because we skip the
> > swapcache during swapin, so there is a possibility that we load the
> > folio from zswap then just drop it without stashing it anywhere.
> >
> > >
> > > > > the folio in the first time of swapin can't be installed in the pagetable
> > > > > successfully and we just free it directly. Then in the second time of
> > > > > swapin, we can't find anything in zswap and read wrong data from swapfile,
> > > > > so this data corruption problem happened.
> > > > >
> > > > > We can fix it by always adding the folio into swapcache if we know the
> > > > > pinned swap entry can be found in zswap, so it won't get freed even though
> > > > > it can't be installed successfully in the first time of swapin.
> > > >
> > > > A concurrent faulting thread could have already checked the swapcache
> > > > before we add the folio to it, right? In this case, that thread will
> > > > go ahead and call swap_read_folio() anyway.
> > > >
> > > > Also, I suspect the zswap lookup might hurt performance. Would it be
> > > > better to add the folio back to zswap upon failure? This should be
> > > > detectable by checking if the folio is dirty as I mentioned in the bug
> > > > report thread.
> > >
> > > I don't like the idea either as sync-io is the fast path for zram etc.
> > > or, can we use
> > > the way of zram to free compressed data?
> >
> > I don't think we want to stop doing exclusive loads in zswap due to this
> > interaction with zram, which shouldn't be common.
> >
> > I think we can solve this by just writing the folio back to zswap upon
> > failure as I mentioned.
>
> Instead of storing again, can we avoid invalidating the entry in the
> first place if the load is not "exclusive"?
>
> The reason for exclusive loads is that the ownership is transferred to
> the swapcache, so there is no point in keeping our copy. With an
> optimistic read that doesn't transfer ownership, this doesn't
> apply. And we can easily tell inside zswap_load() if we're dealing
> with a swapcache read or not by testing the folio.
>
> The synchronous read already has to pin the swp_entry_t to be safe,
> using swapcache_prepare(). That blocks __read_swap_cache_async() which
> means no other (exclusive) loads and no invalidates can occur.
>
> The zswap entry is freed during the regular swap_free() path, which
> the sync fault calls on success. Otherwise we keep it.
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 535c907345e0..686364a6dd86 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> +       bool swapcache = folio_test_swapcache(folio);
>         struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>         u8 *dst;
> @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
>                 spin_unlock(&tree->lock);
>                 return false;
>         }
> -       zswap_rb_erase(&tree->rbroot, entry);
> +       if (swapcache)
> +               zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         if (entry->length)
> @@ -1649,9 +1651,10 @@ bool zswap_load(struct folio *folio)
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
> -       zswap_entry_free(entry);
> -
> -       folio_mark_dirty(folio);
> +       if (swapcache) {
> +               zswap_entry_free(entry);
> +               folio_mark_dirty(folio);
> +       }
>
>         return true;
>  }

Hi Johannes,

After a few hours I haven't seen any problems.

If you don't mind，please add the following tag:
Tested-by:Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx>
Reported-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx>

Thanks.