On 2024/1/30 08:22, Yosry Ahmed wrote: > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote: >> LRU writeback has race problem with swapoff, as spotted by Yosry [1]: >> >> CPU1 CPU2 >> shrink_memcg_cb swap_off >> list_lru_isolate zswap_invalidate >> zswap_swapoff >> kfree(tree) >> // UAF >> spin_lock(&tree->lock) >> >> The problem is that the entry in lru list can't protect the tree from >> being swapoff and freed, and the entry also can be invalidated and freed >> concurrently after we unlock the lru lock. >> >> We can fix it by moving the swap cache allocation ahead before >> referencing the tree, then check invalidate race with tree lock, >> only after that we can safely deref the entry. Note we couldn't >> deref entry or tree anymore after we unlock the folio, since we >> depend on this to hold on swapoff. >> >> So this patch moves all tree and entry usage to zswap_writeback_entry(), >> we only use the copied swpentry on the stack to allocate swap cache >> and if returned with folio locked we can reference the tree safely. >> Then we can check invalidate race with tree lock, the following things >> is much the same like zswap_load(). >> >> Since we can't deref the entry after zswap_writeback_entry(), we >> can't use zswap_lru_putback() anymore, instead we rotate the entry >> in the beginning. And it will be unlinked and freed when invalidated >> if writeback success. > > You are also removing one redundant lookup from the zswap writeback path > to check for the invalidation race, and simplifying the code. Give > yourself full credit :) Ah right, forgot to mention it, I will add this part in the commit message. Thanks for your reminder! > >> >> Another change is we don't update the memcg nr_zswap_protected in >> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced >> with swapin or concurrent shrinker action, since swapin already >> have memcg nr_zswap_protected updated, don't need double counts here. >> For concurrent shrinker, the folio will be writeback and freed anyway. >> -ENOMEM case is extremely rare and doesn't happen spuriously either, >> so don't bother distinguishing this case. >> >> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@xxxxxxxxxxxxxx/ >> >> Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> >> Acked-by: Nhat Pham <nphamcs@xxxxxxxxx> >> Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> >> --- >> mm/zswap.c | 114 ++++++++++++++++++++++++++----------------------------------- >> 1 file changed, 49 insertions(+), 65 deletions(-) >> >> diff --git a/mm/zswap.c b/mm/zswap.c >> index 81cb3790e0dd..f5cb5a46e4d7 100644 >> --- a/mm/zswap.c >> +++ b/mm/zswap.c >> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp) >> zpool_get_type((p)->zpools[0])) >> >> static int zswap_writeback_entry(struct zswap_entry *entry, >> - struct zswap_tree *tree); >> + swp_entry_t swpentry); >> static int zswap_pool_get(struct zswap_pool *pool); >> static void zswap_pool_put(struct zswap_pool *pool); >> >> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) >> rcu_read_unlock(); >> } >> >> -static void zswap_lru_putback(struct list_lru *list_lru, >> - struct zswap_entry *entry) >> -{ >> - int nid = entry_to_nid(entry); >> - spinlock_t *lock = &list_lru->node[nid].lock; >> - struct mem_cgroup *memcg; >> - struct lruvec *lruvec; >> - >> - rcu_read_lock(); >> - memcg = mem_cgroup_from_entry(entry); >> - spin_lock(lock); >> - /* we cannot use list_lru_add here, because it increments node's lru count */ >> - list_lru_putback(list_lru, &entry->lru, nid, memcg); >> - spin_unlock(lock); >> - >> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry))); >> - /* increment the protection area to account for the LRU rotation. */ >> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected); >> - rcu_read_unlock(); >> -} >> - >> /********************************* >> * rbtree functions >> **********************************/ >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o >> { >> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); >> bool *encountered_page_in_swapcache = (bool *)arg; >> - struct zswap_tree *tree; >> - pgoff_t swpoffset; >> + swp_entry_t swpentry; >> enum lru_status ret = LRU_REMOVED_RETRY; >> int writeback_result; >> >> + /* >> + * Rotate the entry to the tail before unlocking the LRU, >> + * so that in case of an invalidation race concurrent >> + * reclaimers don't waste their time on it. >> + * >> + * If writeback succeeds, or failure is due to the entry >> + * being invalidated by the swap subsystem, the invalidation >> + * will unlink and free it. >> + * >> + * Temporary failures, where the same entry should be tried >> + * again immediately, almost never happen for this shrinker. >> + * We don't do any trylocking; -ENOMEM comes closest, >> + * but that's extremely rare and doesn't happen spuriously >> + * either. Don't bother distinguishing this case. >> + * >> + * But since they do exist in theory, the entry cannot just >> + * be unlinked, or we could leak it. Hence, rotate. > > The entry cannot be unlinked because we cannot get a ref on it without > holding the tree lock, and we cannot deref the tree before we acquire a > swap cache ref in zswap_writeback_entry() -- or if > zswap_writeback_entry() fails. This should be called out explicitly > somewhere. Perhaps we can describe this whole deref dance with added > docs to shrink_memcg_cb(). Maybe we should add some comments before or after zswap_writeback_entry()? Or do you have some suggestions? I'm not good at this. :) > > We could also use a comment in zswap_writeback_entry() (or above it) to > state that we only deref the tree *after* we get the swapcache ref. I just notice there are some comments in zswap_writeback_entry(), should we add more comments here? /* * folio is locked, and the swapcache is now secured against * concurrent swapping to and from the slot. Verify that the * swap entry hasn't been invalidated and recycled behind our * backs (our zswap_entry reference doesn't prevent that), to * avoid overwriting a new swap folio with old compressed data. */