On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote: > On 2024/1/30 08:22, Yosry Ahmed wrote: > > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote: > >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > >> { > >> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); > >> bool *encountered_page_in_swapcache = (bool *)arg; > >> - struct zswap_tree *tree; > >> - pgoff_t swpoffset; > >> + swp_entry_t swpentry; > >> enum lru_status ret = LRU_REMOVED_RETRY; > >> int writeback_result; > >> > >> + /* > >> + * Rotate the entry to the tail before unlocking the LRU, > >> + * so that in case of an invalidation race concurrent > >> + * reclaimers don't waste their time on it. > >> + * > >> + * If writeback succeeds, or failure is due to the entry > >> + * being invalidated by the swap subsystem, the invalidation > >> + * will unlink and free it. > >> + * > >> + * Temporary failures, where the same entry should be tried > >> + * again immediately, almost never happen for this shrinker. > >> + * We don't do any trylocking; -ENOMEM comes closest, > >> + * but that's extremely rare and doesn't happen spuriously > >> + * either. Don't bother distinguishing this case. > >> + * > >> + * But since they do exist in theory, the entry cannot just > >> + * be unlinked, or we could leak it. Hence, rotate. > > > > The entry cannot be unlinked because we cannot get a ref on it without > > holding the tree lock, and we cannot deref the tree before we acquire a > > swap cache ref in zswap_writeback_entry() -- or if > > zswap_writeback_entry() fails. This should be called out explicitly > > somewhere. Perhaps we can describe this whole deref dance with added > > docs to shrink_memcg_cb(). > > Maybe we should add some comments before or after zswap_writeback_entry()? > Or do you have some suggestions? I'm not good at this. :) I agree with the suggestion of a central point to document this. How about something like this: /* * As soon as we drop the LRU lock, the entry can be freed by * a concurrent invalidation. This means the following: * * 1. We extract the swp_entry_t to the stack, allowing * zswap_writeback_entry() to pin the swap entry and * then validate the zwap entry against that swap entry's * tree using pointer value comparison. Only when that * is successful can the entry be dereferenced. * * 2. Usually, objects are taken off the LRU for reclaim. In * this case this isn't possible, because if reclaim fails * for whatever reason, we have no means of knowing if the * entry is alive to put it back on the LRU. * * So rotate it before dropping the lock. If the entry is * written back or invalidated, the free path will unlink * it. For failures, rotation is the right thing as well. * * Temporary failures, where the same entry should be tried * again immediately, almost never happen for this shrinker. * We don't do any trylocking; -ENOMEM comes closest, * but that's extremely rare and doesn't happen spuriously * either. Don't bother distinguishing this case. */ > > We could also use a comment in zswap_writeback_entry() (or above it) to > > state that we only deref the tree *after* we get the swapcache ref. > > I just notice there are some comments in zswap_writeback_entry(), should > we add more comments here? > > /* > * folio is locked, and the swapcache is now secured against > * concurrent swapping to and from the slot. Verify that the > * swap entry hasn't been invalidated and recycled behind our > * backs (our zswap_entry reference doesn't prevent that), to > * avoid overwriting a new swap folio with old compressed data. > */ The bit in () is now stale, since we're not even holding a ref ;) Otherwise, a brief note that entry can not be dereferenced until validation would be helpful in zswap_writeback_entry(). The core of the scheme I'd probably describe in shrink_memcg_cb(), though. Can I ask a favor, though? For non-critical updates to this patch, can you please make them follow-up changes? I just sent out 20 cleanup patches on top of this patch which would be super painful and error prone to rebase. I'd like to avoid that if at all possible.