On Fri, Jan 26, 2024 at 7:31 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Fri, Jan 26, 2024 at 08:30:15AM +0000, chengming.zhou@xxxxxxxxx wrote: > > From: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> > > > > LRU writeback has race problem with swapoff, as spotted by Yosry[1]: > > > > CPU1 CPU2 > > shrink_memcg_cb swap_off > > list_lru_isolate zswap_invalidate > > zswap_swapoff > > kfree(tree) > > // UAF > > spin_lock(&tree->lock) > > > > The problem is that the entry in lru list can't protect the tree from > > being swapoff and freed, and the entry also can be invalidated and freed > > concurrently after we unlock the lru lock. > > > > We can fix it by moving the swap cache allocation ahead before > > referencing the tree, then check invalidate race with tree lock, > > only after that we can safely deref the entry. Note we couldn't > > deref entry or tree anymore after we unlock the folio, since we > > depend on this to hold on swapoff. > > This is a great simplification on top of being a bug fix. > > > So this patch moves all tree and entry usage to zswap_writeback_entry(), > > we only use the copied swpentry on the stack to allocate swap cache > > and return with folio locked, after which we can reference the tree. > > Then check invalidate race with tree lock, the following things is > > much the same like zswap_load(). > > > > Since we can't deref the entry after zswap_writeback_entry(), we > > can't use zswap_lru_putback() anymore, instead we rotate the entry > > in the LRU list so concurrent reclaimers have little chance to see it. > > Or it will be deleted from LRU list if writeback success. > > > > Another confusing part to me is the update of memcg nr_zswap_protected > > in zswap_lru_putback(). I'm not sure why it's needed here since > > if we raced with swapin, memcg nr_zswap_protected has already been > > updated in zswap_folio_swapin(). So not include this part for now. > > Good observation. > > Technically, it could also fail on -ENOMEM, but in practice these size > allocations don't fail, especially since the shrinker runs in > PF_MEMALLOC context. The shrink_worker might be affected, since it > doesn't But the common case is -EEXIST, which indeed double counts. Yup. At the time, I was thinking more along the lines of what mechanisms should trigger protection size increase. "swapin" and "LRU list rotations" were two different mechanisms in my head. I was aware that there could be double counting, but deemed it OK, as the cost of over-shrinking (increase in swapin) was fairly serious, and if we have a fairly aggressive decaying strategy if we protect too much. But yes, I doubt it mattered that much in hindsight :) And the case where it is double counted far outnumber the case where it does not, so I'm fine with removing it here. > > To make it "correct", you'd have to grab an objcg reference with the > LRU lock, and also re-order the objcg put on entry freeing after the > LRU del. This is probably not worth doing. But it could use a comment. > > I was going to ask if you could reorder objcg uncharging after LRU > deletion to make it more robust for future changes in that direction. > However, staring at this, I notice this is a second UAF bug: > > if (entry->objcg) { > obj_cgroup_uncharge_zswap(entry->objcg, entry->length); > obj_cgroup_put(entry->objcg); > } > if (!entry->length) > atomic_dec(&zswap_same_filled_pages); > else { > zswap_lru_del(&entry->pool->list_lru, entry); > > zswap_lru_del() uses entry->objcg to determine the list_lru memcg, but > the put may have killed it. I'll send a separate patch on top. > > > @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o > > { > > struct zswap_entry *entry = container_of(item, struct zswap_entry, lru); > > bool *encountered_page_in_swapcache = (bool *)arg; > > - struct zswap_tree *tree; > > - pgoff_t swpoffset; > > + swp_entry_t swpentry; > > enum lru_status ret = LRU_REMOVED_RETRY; > > int writeback_result; > > > > + /* > > + * First rotate to the tail of lru list before unlocking lru lock, > > + * so the concurrent reclaimers have little chance to see it. > > + * It will be deleted from the lru list if writeback success. > > + */ > > + list_move_tail(item, &l->list); > > We don't hold a reference to the object, so there could also be an > invalidation waiting on the LRU lock, which will free the entry even > when writeback fails. > > It would also be good to expand on the motivation, because it's not > clear WHY you'd want to hide it from other reclaimers. > > Lastly, maybe mention the story around temporary failures? Most > shrinkers have a lock inversion pattern (object lock -> LRU lock for > linking versus LRU lock -> object trylock during reclaim) that can > fail and require the same object be tried again before advancing. > > How about this? > > /* > * Rotate the entry to the tail before unlocking the LRU, > * so that in case of an invalidation race concurrent > * reclaimers don't waste their time on it. > * > * If writeback succeeds, or failure is due to the entry > * being invalidated by the swap subsystem, the invalidation > * will unlink and free it. > * > * Temporary failures, where the same entry should be tried > * again immediately, almost never happen for this shrinker. > * We don't do any trylocking; -ENOMEM comes closest, > * but that's extremely rare and doesn't happen spuriously > * either. Don't bother distinguishing this case. > * > * But since they do exist in theory, the entry cannot just > * be unlinked, or we could leak it. Hence, rotate. > */ > > Otherwise, looks great to me. > > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>