On Fri, Mar 22, 2024 at 05:14:37PM -0700, Yosry Ahmed wrote: > [..] > > > > I don't think we want to stop doing exclusive loads in zswap due to this > > > > interaction with zram, which shouldn't be common. > > > > > > > > I think we can solve this by just writing the folio back to zswap upon > > > > failure as I mentioned. > > > > > > Instead of storing again, can we avoid invalidating the entry in the > > > first place if the load is not "exclusive"? > > > > > > The reason for exclusive loads is that the ownership is transferred to > > > the swapcache, so there is no point in keeping our copy. With an > > > optimistic read that doesn't transfer ownership, this doesn't > > > apply. And we can easily tell inside zswap_load() if we're dealing > > > with a swapcache read or not by testing the folio. > > > > > > The synchronous read already has to pin the swp_entry_t to be safe, > > > using swapcache_prepare(). That blocks __read_swap_cache_async() which > > > means no other (exclusive) loads and no invalidates can occur. > > > > > > The zswap entry is freed during the regular swap_free() path, which > > > the sync fault calls on success. Otherwise we keep it. > > > > I thought about this, but I was particularly worried about the need to > > bring back the refcount that was removed when we switched to only > > supporting exclusive loads: > > https://lore.kernel.org/lkml/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@xxxxxxxxxxxxx/ > > > > It seems to be that we don't need it, because swap_free() will free > > the entry as you mentioned before anyone else has the chance to load > > it or invalidate it. Writeback used to grab a reference as well, but > > it removes the entry from the tree anyway and takes full ownership of > > it then frees it, so that should be okay. > > > > It makes me nervous though to be honest. For example, not long ago > > swap_free() didn't call zswap_invalidate() directly (used to happen to > > swap slots cache draining). Without it, a subsequent load could race > > with writeback without refcount protection, right? We would need to > > make sure to backport 0827a1fb143f ("mm/zswap: invalidate zswap entry > > when swap entry free") with the fix to stable for instance. > > > > I can't find a problem with your diff, but it just makes me nervous to > > have non-exclusive loads without a refcount. > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 535c907345e0..686364a6dd86 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio) > > > swp_entry_t swp = folio->swap; > > > pgoff_t offset = swp_offset(swp); > > > struct page *page = &folio->page; > > > + bool swapcache = folio_test_swapcache(folio); > > > struct zswap_tree *tree = swap_zswap_tree(swp); > > > struct zswap_entry *entry; > > > u8 *dst; > > > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio) > > > spin_unlock(&tree->lock); > > > return false; > > > } > > > - zswap_rb_erase(&tree->rbroot, entry); > > > + if (swapcache) > > > + zswap_rb_erase(&tree->rbroot, entry); > > On second thought, if we don't remove the entry from the tree here, > writeback could free the entry from under us after we drop the lock > here, right? The sync-swapin does swapcache_prepare() and holds SWAP_HAS_CACHE, so racing writeback would loop on the -EEXIST in __read_swap_cache_async(). (Or, if writeback wins the race, sync-swapin fails on swapcache_prepare() instead and bails on the fault.) This isn't coincidental. The sync-swapin needs to, and does, serialize against the swap entry moving into swapcache or being invalidated for it to be safe. Which is the same requirement that zswap ops have.