Re: [PATCH v7] zswap: replace RB tree with xarray

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 20 Mar 2024 20:12:27 +0000

On Wed, Mar 20, 2024 at 04:03:22PM -0400, Johannes Weiner wrote:
> On Wed, Mar 20, 2024 at 07:34:36PM +0000, Yosry Ahmed wrote:
> > On Wed, Mar 20, 2024 at 03:25:58PM -0400, Johannes Weiner wrote:
> > > On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> > > > On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > > > > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > > > > [..]
> > > > > > > > > -     /* map */
> > > > > > > > > -     spin_lock(&tree->lock);
> > > > > > > > >       /*
> > > > > > > > > -      * The folio may have been dirtied again, invalidate the
> > > > > > > > > -      * possibly stale entry before inserting the new entry.
> > > > > > > > > +      * We finish initializing the entry while it's already in xarray.
> > > > > > > > > +      * This is safe because:
> > > > > > > > > +      *
> > > > > > > > > +      * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > > > > +      *
> > > > > > > > > +      * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > > > > +      *    The publishing order matters to prevent writeback from seeing
> > > > > > > > > +      *    an incoherent entry.
> > > > > > > >
> > > > > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > > > > so I don't think the publishing order actually matters here. Right?
> > > > > > > 
> > > > > > > Right. This comment is explaining why this publishing order does not
> > > > > > > matter. I think we are talking about the same thing here?
> > > > > > 
> > > > > > The comment literally says "the publishing order matters.." :)
> > > > > > 
> > > > > > I believe Johannes meant that we should only publish the entry to the
> > > > > > LRU once it is fully initialized, to prevent writeback from using a
> > > > > > partially initialized entry.
> > > > > > 
> > > > > > What I am saying is that, even if we add a partially initialized entry
> > > > > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > > > > locked in the swapcache.
> > > > > > 
> > > > > > So basically I think the comment should say:
> > > > > > 
> > > > > > 	/*
> > > > > > 	 * We finish initializing the entry while it's already in the
> > > > > > 	 * xarray. This is safe because the folio is locked in the swap
> > > > > > 	 * cache, which should protect against concurrent stores,
> > > > > > 	 * invalidations, and writeback.
> > > > > > 	 */
> > > > > > 
> > > > > > Johannes, what do you think?
> > > > > 
> > > > > I don't think that's quite right.
> > > > > 
> > > > > Writeback will bail on swapcache insert, yes, but it will access the
> > > > > entry before attempting it. If LRU publishing happened before setting
> > > > > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > > > > it would be safe to rearrange the code like this.
> > > > > 
> > > > > So LRU publishing order does matter.
> > > > 
> > > > Ah yes, you are right. entry->swpentry should be set to make sure we
> > > > lookup the correct entry in the swapcache and the tree.
> > > > 
> > > > Perhaps we should spell this out in the comment and make the
> > > > initialization ordering more explicit? Maybe something like:
> > > > 
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index d8a14b27adcd7..70924b437743a 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> > > >  		goto put_pool;
> > > >  
> > > >  insert_entry:
> > > > -	entry->swpentry = swp;
> > > > -	entry->objcg = objcg;
> > > > -
> > > >  	old = xa_store(tree, offset, entry, GFP_KERNEL);
> > > >  	if (xa_is_err(old)) {
> > > >  		int err = xa_err(old);
> > > > @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> > > >  	if (old)
> > > >  		zswap_entry_free(old);
> > > >  
> > > > +	entry->objcg = objcg;
> > > >  	if (objcg) {
> > > >  		obj_cgroup_charge_zswap(objcg, entry->length);
> > > >  		count_objcg_event(objcg, ZSWPOUT);
> > > > @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
> > > >  
> > > >  	/*
> > > >  	 * We finish initializing the entry while it's already in xarray.
> > > > -	 * This is safe because:
> > > > -	 *
> > > > -	 * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > +	 * This is safe because the folio is locked in the swapcache, which
> > > > +	 * protects against concurrent stores and invalidations.
> > > >  	 *
> > > > -	 * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > -	 *    The publishing order matters to prevent writeback from seeing
> > > > -	 *    an incoherent entry.
> > > > +	 * Concurrent writeback is not possible until we add the entry to the
> > > > +	 * LRU. We need to at least initialize entry->swpentry *before* adding
> > > > +	 * the entry to the LRU to make sure writeback looks up the correct
> > > > +	 * entry in the swapcache.
> > > >  	 */
> > > >  	if (entry->length) {
> > > > +		entry->swpentry = swp;
> > > >  		INIT_LIST_HEAD(&entry->lru);
> > > >  		zswap_lru_add(&zswap_list_lru, entry);
> > > >  		atomic_inc(&zswap_nr_stored);
> > > > 
> > > > 
> > > > This also got me wondering, do we need a write barrier between
> > > > initializing entry->swpentry and zswap_lru_add()?
> > > > 
> > > > I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> > > > eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> > > > bug-prone.
> > > 
> > > I think it's more robust the way Chris has it now. Writeback only
> > > derefs ->swpentry today, but who knows if somebody wants to make a
> > > changes that relies on a different member. Having submembers follow
> > > different validity rules and timelines is error prone and makes the
> > > code less hackable without buying all that much. The concept of
> > > "publishing" an object like this is more common: if you can see it,
> > > you can expect it to be coherent.
> > 
> > Fair enough, but don't we still need a barrier there? Couldn't some
> > initializations still be reorder after zswap_lru_add()?
> 
> Only if it were lockless. The LRU unlocking in zswap_store() implies
> RELEASE, the LRU locking in writeback implies ACQUIRE. Those force the
> desired ordering - nothing can bleed after RELEASE, nothing can bleed
> before ACQUIRE.

Ah yes, I found smp_store_release() deep in the spin_unlock() code.
Thanks for pointing this out, all is good in the world.