On Sat, Mar 23, 2024 at 7:48 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Fri, Mar 22, 2024 at 10:33:13PM +0000, Yosry Ahmed wrote: > > On Sat, Mar 23, 2024 at 10:41:32AM +1300, Barry Song wrote: > > > On Sat, Mar 23, 2024 at 8:38 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@xxxxxxxxx> wrote: > > > > > > > > > > From: Chengming Zhou <chengming.zhou@xxxxxxxxx> > > > > > > > > > > There is a report of data corruption caused by double swapin, which is > > > > > only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends. > > > > > > > > > > The root cause is that zswap is not like other "normal" swap backends, > > > > > it won't keep the copy of data after the first time of swapin. So if > > > > > > I don't quite understand this, so once we load a page from zswap, zswap > > > will free it even though do_swap_page might not set it to PTE? > > > > > > shouldn't zswap free the memory after notify_free just like zram? > > > > It's an optimization that zswap has, exclusive loads. After a page is > > swapped in it can stick around in the swapcache for a while. In this > > case, there would be two copies in memory with zram (compressed and > > uncompressed). Zswap implements exclusive loads to drop the compressed > > copy. The folio is marked as dirty so that any attempts to reclaim it > > cause a new write (compression) to zswap. It is also for a lot of > > cleanups and straightforward entry lifetime tracking in zswap. > > > > It is mostly fine, the problem here happens because we skip the > > swapcache during swapin, so there is a possibility that we load the > > folio from zswap then just drop it without stashing it anywhere. > > > > > > > > > > the folio in the first time of swapin can't be installed in the pagetable > > > > > successfully and we just free it directly. Then in the second time of > > > > > swapin, we can't find anything in zswap and read wrong data from swapfile, > > > > > so this data corruption problem happened. > > > > > > > > > > We can fix it by always adding the folio into swapcache if we know the > > > > > pinned swap entry can be found in zswap, so it won't get freed even though > > > > > it can't be installed successfully in the first time of swapin. > > > > > > > > A concurrent faulting thread could have already checked the swapcache > > > > before we add the folio to it, right? In this case, that thread will > > > > go ahead and call swap_read_folio() anyway. > > > > > > > > Also, I suspect the zswap lookup might hurt performance. Would it be > > > > better to add the folio back to zswap upon failure? This should be > > > > detectable by checking if the folio is dirty as I mentioned in the bug > > > > report thread. > > > > > > I don't like the idea either as sync-io is the fast path for zram etc. > > > or, can we use > > > the way of zram to free compressed data? > > > > I don't think we want to stop doing exclusive loads in zswap due to this > > interaction with zram, which shouldn't be common. > > > > I think we can solve this by just writing the folio back to zswap upon > > failure as I mentioned. > > Instead of storing again, can we avoid invalidating the entry in the > first place if the load is not "exclusive"? > > The reason for exclusive loads is that the ownership is transferred to > the swapcache, so there is no point in keeping our copy. With an > optimistic read that doesn't transfer ownership, this doesn't > apply. And we can easily tell inside zswap_load() if we're dealing > with a swapcache read or not by testing the folio. > > The synchronous read already has to pin the swp_entry_t to be safe, > using swapcache_prepare(). That blocks __read_swap_cache_async() which > means no other (exclusive) loads and no invalidates can occur. > > The zswap entry is freed during the regular swap_free() path, which > the sync fault calls on success. Otherwise we keep it. > > diff --git a/mm/zswap.c b/mm/zswap.c > index 535c907345e0..686364a6dd86 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio) > swp_entry_t swp = folio->swap; > pgoff_t offset = swp_offset(swp); > struct page *page = &folio->page; > + bool swapcache = folio_test_swapcache(folio); > struct zswap_tree *tree = swap_zswap_tree(swp); > struct zswap_entry *entry; > u8 *dst; > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio) > spin_unlock(&tree->lock); > return false; > } > - zswap_rb_erase(&tree->rbroot, entry); > + if (swapcache) > + zswap_rb_erase(&tree->rbroot, entry); > spin_unlock(&tree->lock); > > if (entry->length) > @@ -1649,9 +1651,10 @@ bool zswap_load(struct folio *folio) > if (entry->objcg) > count_objcg_event(entry->objcg, ZSWPIN); > > - zswap_entry_free(entry); > - > - folio_mark_dirty(folio); > + if (swapcache) { > + zswap_entry_free(entry); > + folio_mark_dirty(folio); > + } > > return true; > } Hi Johannes, After a few hours I haven't seen any problems. If you don't mind,please add the following tag: Tested-by:Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> Reported-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx> Thanks.