On Wed, Mar 20, 2024 at 04:03:22PM -0400, Johannes Weiner wrote: > On Wed, Mar 20, 2024 at 07:34:36PM +0000, Yosry Ahmed wrote: > > On Wed, Mar 20, 2024 at 03:25:58PM -0400, Johannes Weiner wrote: > > > On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote: > > > > On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote: > > > > > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote: > > > > > > [..] > > > > > > > > > - /* map */ > > > > > > > > > - spin_lock(&tree->lock); > > > > > > > > > /* > > > > > > > > > - * The folio may have been dirtied again, invalidate the > > > > > > > > > - * possibly stale entry before inserting the new entry. > > > > > > > > > + * We finish initializing the entry while it's already in xarray. > > > > > > > > > + * This is safe because: > > > > > > > > > + * > > > > > > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock. > > > > > > > > > + * > > > > > > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet. > > > > > > > > > + * The publishing order matters to prevent writeback from seeing > > > > > > > > > + * an incoherent entry. > > > > > > > > > > > > > > > > As I mentioned before, writeback is also protected by the folio lock. > > > > > > > > Concurrent writeback will find the folio in the swapcache and abort. The > > > > > > > > fact that the entry is not on the LRU yet is just additional protection, > > > > > > > > so I don't think the publishing order actually matters here. Right? > > > > > > > > > > > > > > Right. This comment is explaining why this publishing order does not > > > > > > > matter. I think we are talking about the same thing here? > > > > > > > > > > > > The comment literally says "the publishing order matters.." :) > > > > > > > > > > > > I believe Johannes meant that we should only publish the entry to the > > > > > > LRU once it is fully initialized, to prevent writeback from using a > > > > > > partially initialized entry. > > > > > > > > > > > > What I am saying is that, even if we add a partially initialized entry > > > > > > to the zswap LRU, writeback will skip it anyway because the folio is > > > > > > locked in the swapcache. > > > > > > > > > > > > So basically I think the comment should say: > > > > > > > > > > > > /* > > > > > > * We finish initializing the entry while it's already in the > > > > > > * xarray. This is safe because the folio is locked in the swap > > > > > > * cache, which should protect against concurrent stores, > > > > > > * invalidations, and writeback. > > > > > > */ > > > > > > > > > > > > Johannes, what do you think? > > > > > > > > > > I don't think that's quite right. > > > > > > > > > > Writeback will bail on swapcache insert, yes, but it will access the > > > > > entry before attempting it. If LRU publishing happened before setting > > > > > entry->swpentry e.g., we'd have a problem, while your comment suggets > > > > > it would be safe to rearrange the code like this. > > > > > > > > > > So LRU publishing order does matter. > > > > > > > > Ah yes, you are right. entry->swpentry should be set to make sure we > > > > lookup the correct entry in the swapcache and the tree. > > > > > > > > Perhaps we should spell this out in the comment and make the > > > > initialization ordering more explicit? Maybe something like: > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index d8a14b27adcd7..70924b437743a 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio) > > > > goto put_pool; > > > > > > > > insert_entry: > > > > - entry->swpentry = swp; > > > > - entry->objcg = objcg; > > > > - > > > > old = xa_store(tree, offset, entry, GFP_KERNEL); > > > > if (xa_is_err(old)) { > > > > int err = xa_err(old); > > > > @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio) > > > > if (old) > > > > zswap_entry_free(old); > > > > > > > > + entry->objcg = objcg; > > > > if (objcg) { > > > > obj_cgroup_charge_zswap(objcg, entry->length); > > > > count_objcg_event(objcg, ZSWPOUT); > > > > @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio) > > > > > > > > /* > > > > * We finish initializing the entry while it's already in xarray. > > > > - * This is safe because: > > > > - * > > > > - * 1. Concurrent stores and invalidations are excluded by folio lock. > > > > + * This is safe because the folio is locked in the swapcache, which > > > > + * protects against concurrent stores and invalidations. > > > > * > > > > - * 2. Writeback is excluded by the entry not being on the LRU yet. > > > > - * The publishing order matters to prevent writeback from seeing > > > > - * an incoherent entry. > > > > + * Concurrent writeback is not possible until we add the entry to the > > > > + * LRU. We need to at least initialize entry->swpentry *before* adding > > > > + * the entry to the LRU to make sure writeback looks up the correct > > > > + * entry in the swapcache. > > > > */ > > > > if (entry->length) { > > > > + entry->swpentry = swp; > > > > INIT_LIST_HEAD(&entry->lru); > > > > zswap_lru_add(&zswap_list_lru, entry); > > > > atomic_inc(&zswap_nr_stored); > > > > > > > > > > > > This also got me wondering, do we need a write barrier between > > > > initializing entry->swpentry and zswap_lru_add()? > > > > > > > > I guess if we read the wrong swpentry in zswap_writeback_entry() we will > > > > eventually fail the xa_cmpxchg() and drop it anyway, but it seems > > > > bug-prone. > > > > > > I think it's more robust the way Chris has it now. Writeback only > > > derefs ->swpentry today, but who knows if somebody wants to make a > > > changes that relies on a different member. Having submembers follow > > > different validity rules and timelines is error prone and makes the > > > code less hackable without buying all that much. The concept of > > > "publishing" an object like this is more common: if you can see it, > > > you can expect it to be coherent. > > > > Fair enough, but don't we still need a barrier there? Couldn't some > > initializations still be reorder after zswap_lru_add()? > > Only if it were lockless. The LRU unlocking in zswap_store() implies > RELEASE, the LRU locking in writeback implies ACQUIRE. Those force the > desired ordering - nothing can bleed after RELEASE, nothing can bleed > before ACQUIRE. Ah yes, I found smp_store_release() deep in the spin_unlock() code. Thanks for pointing this out, all is good in the world.