On Fri, Jan 26, 2024 at 12:32 AM <chengming.zhou@xxxxxxxxx> wrote: > > From: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> > > LRU writeback has race problem with swapoff, as spotted by Yosry[1]: > > CPU1 CPU2 > shrink_memcg_cb swap_off > list_lru_isolate zswap_invalidate > zswap_swapoff > kfree(tree) > // UAF > spin_lock(&tree->lock) > > The problem is that the entry in lru list can't protect the tree from > being swapoff and freed, and the entry also can be invalidated and freed > concurrently after we unlock the lru lock. > > We can fix it by moving the swap cache allocation ahead before > referencing the tree, then check invalidate race with tree lock, > only after that we can safely deref the entry. Note we couldn't > deref entry or tree anymore after we unlock the folio, since we > depend on this to hold on swapoff. > > So this patch moves all tree and entry usage to zswap_writeback_entry(), > we only use the copied swpentry on the stack to allocate swap cache > and return with folio locked, after which we can reference the tree. > Then check invalidate race with tree lock, the following things is > much the same like zswap_load(). > > Since we can't deref the entry after zswap_writeback_entry(), we > can't use zswap_lru_putback() anymore, instead we rotate the entry I added list_lru_putback to the list_lru API specifically for this use case (zswap_lru_putback()). Now that we no longer need it, maybe we can also remove this as well (assuming no-one else is using this?). This can be done in a separate patch though. > in the LRU list so concurrent reclaimers have little chance to see it. > Or it will be deleted from LRU list if writeback success. > > Another confusing part to me is the update of memcg nr_zswap_protected > in zswap_lru_putback(). I'm not sure why it's needed here since > if we raced with swapin, memcg nr_zswap_protected has already been > updated in zswap_folio_swapin(). So not include this part for now. > > [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@xxxxxxxxxxxxxx/ > > Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> LGTM! This is quite elegant. Acked-by: Nhat Pham <nphamcs@xxxxxxxxx> > --- > mm/zswap.c | 93 ++++++++++++++++++------------------------------------ > 1 file changed, 31 insertions(+), 62 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 81cb3790e0dd..fa2bdb7ec1d8 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp) > zpool_get_type((p)->zpools[0])) > > static int zswap_writeback_entry(struct zswap_entry *entry, > - struct zswap_tree *tree); > + swp_entry_t swpentry); > static int zswap_pool_get(struct zswap_pool *pool); > static void zswap_pool_put(struct zswap_pool *pool); > > @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry) > rcu_read_unlock(); > } > > -static void zswap_lru_putback(struct list_lru *list_lru, > - struct zswap_entry *entry) > -{ > - int nid = entry_to_nid(entry); > - spinlock_t *lock = &list_lru->node[nid].lock; > - struct mem_cgroup *memcg; > - struct lruvec *lruvec; > - > - rcu_read_lock(); > - memcg = mem_cgroup_from_entry(entry); > - spin_lock(lock); > - /* we cannot use list_lru_add here, because it increments node's lru count */ > - list_lru_putback(list_lru, &entry->lru, nid, memcg); > - spin_unlock(lock); > - > - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry))); > - /* increment the protection area to account for the LRU rotation. */ > - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected); > - rcu_read_unlock(); > -} > - > /********************************* > * rbtree functions > **********************************/ > @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > { > struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); > bool *encountered_page_in_swapcache = (bool *)arg; > - struct zswap_tree *tree; > - pgoff_t swpoffset; > + swp_entry_t swpentry; > enum lru_status ret = LRU_REMOVED_RETRY; > int writeback_result; > > + /* > + * First rotate to the tail of lru list before unlocking lru lock, > + * so the concurrent reclaimers have little chance to see it. > + * It will be deleted from the lru list if writeback success. > + */ > + list_move_tail(item, &l->list); > + > /* > * Once the lru lock is dropped, the entry might get freed. The > - * swpoffset is copied to the stack, and entry isn't deref'd again > + * swpentry is copied to the stack, and entry isn't deref'd again > * until the entry is verified to still be alive in the tree. > */ > - swpoffset = swp_offset(entry->swpentry); > - tree = swap_zswap_tree(entry->swpentry); > - list_lru_isolate(l, item); nit: IIUC, now that we're no longer removing the entry from the list_lru, we protect against concurrent shrinking action via this check inside zswap_writeback_entry() too right: if (!folio_was_allocated) { folio_put(folio); return -EEXIST; } Maybe update the comment above it too? > + swpentry = entry->swpentry; > + > /* > * It's safe to drop the lock here because we return either > * LRU_REMOVED_RETRY or LRU_RETRY. > */ > spin_unlock(lock); > > - /* Check for invalidate() race */ > - spin_lock(&tree->lock); > - if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) > - goto unlock; > - > - /* Hold a reference to prevent a free during writeback */ > - zswap_entry_get(entry); > - spin_unlock(&tree->lock); > - > - writeback_result = zswap_writeback_entry(entry, tree); > + writeback_result = zswap_writeback_entry(entry, swpentry); > > - spin_lock(&tree->lock); > if (writeback_result) { > zswap_reject_reclaim_fail++; > - zswap_lru_putback(&entry->pool->list_lru, entry); > ret = LRU_RETRY; > > /* > @@ -903,27 +876,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > */ > if (writeback_result == -EEXIST && encountered_page_in_swapcache) > *encountered_page_in_swapcache = true; > - > - goto put_unlock; > + } else { > + zswap_written_back_pages++; > } > - zswap_written_back_pages++; > - > - if (entry->objcg) > - count_objcg_event(entry->objcg, ZSWPWB); > - > - count_vm_event(ZSWPWB); > - /* > - * Writeback started successfully, the page now belongs to the > - * swapcache. Drop the entry from zswap - unless invalidate already > - * took it out while we had the tree->lock released for IO. > - */ > - zswap_invalidate_entry(tree, entry); > > -put_unlock: > - /* Drop local reference */ > - zswap_entry_put(entry); > -unlock: > - spin_unlock(&tree->lock); > spin_lock(lock); > return ret; > } > @@ -1408,9 +1364,9 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page) > * freed. > */ > static int zswap_writeback_entry(struct zswap_entry *entry, > - struct zswap_tree *tree) > + swp_entry_t swpentry) > { > - swp_entry_t swpentry = entry->swpentry; > + struct zswap_tree *tree; > struct folio *folio; > struct mempolicy *mpol; > bool folio_was_allocated; > @@ -1442,18 +1398,31 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * backs (our zswap_entry reference doesn't prevent that), to > * avoid overwriting a new swap folio with old compressed data. > */ > + tree = swap_zswap_tree(swpentry); > spin_lock(&tree->lock); > - if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) { > + if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) { > spin_unlock(&tree->lock); > delete_from_swap_cache(folio); > folio_unlock(folio); > folio_put(folio); > return -ENOMEM; > } > + > + /* Safe to deref entry after the entry is verified above. */ > + zswap_entry_get(entry); > spin_unlock(&tree->lock); > > __zswap_load(entry, &folio->page); > > + count_vm_event(ZSWPWB); > + if (entry->objcg) > + count_objcg_event(entry->objcg, ZSWPWB); > + > + spin_lock(&tree->lock); > + zswap_invalidate_entry(tree, entry); > + zswap_entry_put(entry); > + spin_unlock(&tree->lock); > + > /* folio is up to date */ > folio_mark_uptodate(folio); > > -- > 2.40.1 >