On Wed, Sep 27, 2023 at 09:48:10PM +0200, Domenico Cerasuolo wrote: > > > @@ -485,6 +487,17 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > > __folio_set_locked(folio); > > > __folio_set_swapbacked(folio); > > > > > > + /* > > > + * Page fault might itself trigger reclaim, on a zswap object that > > > + * corresponds to the same swap entry. However, as the swap entry has > > > + * previously been pinned, the task will run into an infinite loop trying > > > + * to pin the swap entry again. > > > + * > > > + * To prevent this from happening, we remove it from the zswap > > > + * LRU to prevent its reclamation. > > > + */ > > > + zswap_lru_removed = zswap_remove_swpentry_from_lru(entry); > > > + > > > > This will add a zswap lookup (and potentially an insertion below) in > > every single swap fault path, right?. Doesn't this introduce latency > > regressions? I am also not a fan of having zswap-specific details in > > this path. > > > > When you say "pinned", do you mean the call to swapcache_prepare() > > above (i.e. setting SWAP_HAS_CACHE)? IIUC, the scenario you are > > worried about is that the following call to charge the page may invoke > > reclaim, go into zswap, and try to writeback the same page we are > > swapping in here. The writeback call will recurse into > > __read_swap_cache_async(), call swapcache_prepare() and get EEXIST, > > and keep looping indefinitely. Is this correct? Yeah, exactly. > > If yes, can we handle this by adding a flag to > > __read_swap_cache_async() that basically says "don't wait for > > SWAP_HAS_CACHE and the swapcache to be consistent, if > > swapcache_prepare() returns EEXIST just fail and return"? The zswap > > writeback path can pass in this flag and skip such pages. We might > > want to modify the writeback code to put back those pages at the end > > of the lru instead of in the beginning. > > Thanks for the suggestion, this actually works and it seems cleaner so I think > we'll go for your solution. That sounds like a great idea. It should be pointed out that these aren't perfectly equivalent. Removing the entry from the LRU eliminates the lock recursion scenario on that very specific entry. Having writeback skip on -EEXIST will make it skip *any* pages that are concurrently entering the swapcache, even when it *could* wait for them to finish. However, pages that are concurrently read back into memory are a poor choice for writeback anyway, and likely to be removed from swap soon. So it happens to work out just fine in this case. I'd just add a comment that explains the recursion deadlock, as well as the implication of skipping any busy entry and why that's okay.