On Wed, Sep 27, 2023 at 2:02 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Wed, Sep 27, 2023 at 09:48:10PM +0200, Domenico Cerasuolo wrote: > > > > @@ -485,6 +487,17 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > > > __folio_set_locked(folio); > > > > __folio_set_swapbacked(folio); > > > > > > > > + /* > > > > + * Page fault might itself trigger reclaim, on a zswap object that > > > > + * corresponds to the same swap entry. However, as the swap entry has > > > > + * previously been pinned, the task will run into an infinite loop trying > > > > + * to pin the swap entry again. > > > > + * > > > > + * To prevent this from happening, we remove it from the zswap > > > > + * LRU to prevent its reclamation. > > > > + */ > > > > + zswap_lru_removed = zswap_remove_swpentry_from_lru(entry); > > > > + > > > > > > This will add a zswap lookup (and potentially an insertion below) in > > > every single swap fault path, right?. Doesn't this introduce latency > > > regressions? I am also not a fan of having zswap-specific details in > > > this path. > > > > > > When you say "pinned", do you mean the call to swapcache_prepare() > > > above (i.e. setting SWAP_HAS_CACHE)? IIUC, the scenario you are > > > worried about is that the following call to charge the page may invoke > > > reclaim, go into zswap, and try to writeback the same page we are > > > swapping in here. The writeback call will recurse into > > > __read_swap_cache_async(), call swapcache_prepare() and get EEXIST, > > > and keep looping indefinitely. Is this correct? > > Yeah, exactly. > > > > If yes, can we handle this by adding a flag to > > > __read_swap_cache_async() that basically says "don't wait for > > > SWAP_HAS_CACHE and the swapcache to be consistent, if > > > swapcache_prepare() returns EEXIST just fail and return"? The zswap > > > writeback path can pass in this flag and skip such pages. We might > > > want to modify the writeback code to put back those pages at the end > > > of the lru instead of in the beginning. > > > > Thanks for the suggestion, this actually works and it seems cleaner so I think > > we'll go for your solution. > > That sounds like a great idea. > > It should be pointed out that these aren't perfectly > equivalent. Removing the entry from the LRU eliminates the lock > recursion scenario on that very specific entry. > > Having writeback skip on -EEXIST will make it skip *any* pages that > are concurrently entering the swapcache, even when it *could* wait for > them to finish. > > However, pages that are concurrently read back into memory are a poor > choice for writeback anyway, and likely to be removed from swap soon. > > So it happens to work out just fine in this case. I'd just add a > comment that explains the recursion deadlock, as well as the > implication of skipping any busy entry and why that's okay. Good point, we will indeed skip even if the concurrent insertion from the swapcache is coming from a different cpu. As you said, it works out just fine in this case, as the page will be removed from zswap momentarily anyway. A comment is indeed due.