Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and swapoff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
> On 2024/1/30 08:22, Yosry Ahmed wrote:
> > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >>  {
> >>  	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> >>  	bool *encountered_page_in_swapcache = (bool *)arg;
> >> -	struct zswap_tree *tree;
> >> -	pgoff_t swpoffset;
> >> +	swp_entry_t swpentry;
> >>  	enum lru_status ret = LRU_REMOVED_RETRY;
> >>  	int writeback_result;
> >>  
> >> +	/*
> >> +	 * Rotate the entry to the tail before unlocking the LRU,
> >> +	 * so that in case of an invalidation race concurrent
> >> +	 * reclaimers don't waste their time on it.
> >> +	 *
> >> +	 * If writeback succeeds, or failure is due to the entry
> >> +	 * being invalidated by the swap subsystem, the invalidation
> >> +	 * will unlink and free it.
> >> +	 *
> >> +	 * Temporary failures, where the same entry should be tried
> >> +	 * again immediately, almost never happen for this shrinker.
> >> +	 * We don't do any trylocking; -ENOMEM comes closest,
> >> +	 * but that's extremely rare and doesn't happen spuriously
> >> +	 * either. Don't bother distinguishing this case.
> >> +	 *
> >> +	 * But since they do exist in theory, the entry cannot just
> >> +	 * be unlinked, or we could leak it. Hence, rotate.
> > 
> > The entry cannot be unlinked because we cannot get a ref on it without
> > holding the tree lock, and we cannot deref the tree before we acquire a
> > swap cache ref in zswap_writeback_entry() -- or if
> > zswap_writeback_entry() fails. This should be called out explicitly
> > somewhere. Perhaps we can describe this whole deref dance with added
> > docs to shrink_memcg_cb().
> 
> Maybe we should add some comments before or after zswap_writeback_entry()?
> Or do you have some suggestions? I'm not good at this. :)

I agree with the suggestion of a central point to document this.

How about something like this:

/*
 * As soon as we drop the LRU lock, the entry can be freed by
 * a concurrent invalidation. This means the following:
 *
 * 1. We extract the swp_entry_t to the stack, allowing
 *    zswap_writeback_entry() to pin the swap entry and
 *    then validate the zwap entry against that swap entry's
 *    tree using pointer value comparison. Only when that
 *    is successful can the entry be dereferenced.
 *
 * 2. Usually, objects are taken off the LRU for reclaim. In
 *    this case this isn't possible, because if reclaim fails
 *    for whatever reason, we have no means of knowing if the
 *    entry is alive to put it back on the LRU.
 *
 *    So rotate it before dropping the lock. If the entry is
 *    written back or invalidated, the free path will unlink
 *    it. For failures, rotation is the right thing as well.
 *
 *    Temporary failures, where the same entry should be tried
 *    again immediately, almost never happen for this shrinker.
 *    We don't do any trylocking; -ENOMEM comes closest,
 *    but that's extremely rare and doesn't happen spuriously
 *    either. Don't bother distinguishing this case.
 */

> > We could also use a comment in zswap_writeback_entry() (or above it) to
> > state that we only deref the tree *after* we get the swapcache ref.
> 
> I just notice there are some comments in zswap_writeback_entry(), should
> we add more comments here?
> 
> 	/*
> 	 * folio is locked, and the swapcache is now secured against
> 	 * concurrent swapping to and from the slot. Verify that the
> 	 * swap entry hasn't been invalidated and recycled behind our
> 	 * backs (our zswap_entry reference doesn't prevent that), to
> 	 * avoid overwriting a new swap folio with old compressed data.
> 	 */

The bit in () is now stale, since we're not even holding a ref ;)

Otherwise, a brief note that entry can not be dereferenced until
validation would be helpful in zswap_writeback_entry(). The core of
the scheme I'd probably describe in shrink_memcg_cb(), though.

Can I ask a favor, though?

For non-critical updates to this patch, can you please make them
follow-up changes? I just sent out 20 cleanup patches on top of this
patch which would be super painful and error prone to rebase. I'd like
to avoid that if at all possible.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux