On 2024/1/30 11:17, Johannes Weiner wrote: > On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote: >> On 2024/1/30 08:22, Yosry Ahmed wrote: >>> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote: >>>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o >>>> { >>>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); >>>> bool *encountered_page_in_swapcache = (bool *)arg; >>>> - struct zswap_tree *tree; >>>> - pgoff_t swpoffset; >>>> + swp_entry_t swpentry; >>>> enum lru_status ret = LRU_REMOVED_RETRY; >>>> int writeback_result; >>>> >>>> + /* >>>> + * Rotate the entry to the tail before unlocking the LRU, >>>> + * so that in case of an invalidation race concurrent >>>> + * reclaimers don't waste their time on it. >>>> + * >>>> + * If writeback succeeds, or failure is due to the entry >>>> + * being invalidated by the swap subsystem, the invalidation >>>> + * will unlink and free it. >>>> + * >>>> + * Temporary failures, where the same entry should be tried >>>> + * again immediately, almost never happen for this shrinker. >>>> + * We don't do any trylocking; -ENOMEM comes closest, >>>> + * but that's extremely rare and doesn't happen spuriously >>>> + * either. Don't bother distinguishing this case. >>>> + * >>>> + * But since they do exist in theory, the entry cannot just >>>> + * be unlinked, or we could leak it. Hence, rotate. >>> >>> The entry cannot be unlinked because we cannot get a ref on it without >>> holding the tree lock, and we cannot deref the tree before we acquire a >>> swap cache ref in zswap_writeback_entry() -- or if >>> zswap_writeback_entry() fails. This should be called out explicitly >>> somewhere. Perhaps we can describe this whole deref dance with added >>> docs to shrink_memcg_cb(). >> >> Maybe we should add some comments before or after zswap_writeback_entry()? >> Or do you have some suggestions? I'm not good at this. :) > > I agree with the suggestion of a central point to document this. > > How about something like this: > > /* > * As soon as we drop the LRU lock, the entry can be freed by > * a concurrent invalidation. This means the following: > * > * 1. We extract the swp_entry_t to the stack, allowing > * zswap_writeback_entry() to pin the swap entry and > * then validate the zwap entry against that swap entry's > * tree using pointer value comparison. Only when that > * is successful can the entry be dereferenced. > * > * 2. Usually, objects are taken off the LRU for reclaim. In > * this case this isn't possible, because if reclaim fails > * for whatever reason, we have no means of knowing if the > * entry is alive to put it back on the LRU. > * > * So rotate it before dropping the lock. If the entry is > * written back or invalidated, the free path will unlink > * it. For failures, rotation is the right thing as well. > * > * Temporary failures, where the same entry should be tried > * again immediately, almost never happen for this shrinker. > * We don't do any trylocking; -ENOMEM comes closest, > * but that's extremely rare and doesn't happen spuriously > * either. Don't bother distinguishing this case. > */ > Thanks! I think this document is great enough. >>> We could also use a comment in zswap_writeback_entry() (or above it) to >>> state that we only deref the tree *after* we get the swapcache ref. >> >> I just notice there are some comments in zswap_writeback_entry(), should >> we add more comments here? >> >> /* >> * folio is locked, and the swapcache is now secured against >> * concurrent swapping to and from the slot. Verify that the >> * swap entry hasn't been invalidated and recycled behind our >> * backs (our zswap_entry reference doesn't prevent that), to >> * avoid overwriting a new swap folio with old compressed data. >> */ > > The bit in () is now stale, since we're not even holding a ref ;) Right. > > Otherwise, a brief note that entry can not be dereferenced until > validation would be helpful in zswap_writeback_entry(). The core of Ok. > the scheme I'd probably describe in shrink_memcg_cb(), though. > > Can I ask a favor, though? > > For non-critical updates to this patch, can you please make them > follow-up changes? I just sent out 20 cleanup patches on top of this > patch which would be super painful and error prone to rebase. I'd like > to avoid that if at all possible. Ok, so these comments changes should be changed on top of your cleanup series and sent as a follow-up patch. Thanks.