> > > > > However, while I have your attention on the swapoff path, there's a > > > > > slightly irrelevant problem that I think might be there, but I am not > > > > > sure. > > > > > > > > > > It looks to me like swapoff can race with writeback, and there may be > > > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff() > > > > > races with shrink_memcg_cb(), I feel like we may free the tree as it > > > > > is being used. For example if zswap_swapoff()->kfree(tree) happen > > > > > right before shrink_memcg_cb()->list_lru_isolate(l, item). > > > > > > > > > > Please tell me that I am being paranoid and that there is some > > > > > protection against zswap writeback racing with swapoff. It feels like > > > > > we are very careful with zswap entries refcounting, but not with the > > > > > zswap tree itself. > > > > > > > > Hm, I don't see how. > > > > > > > > Writeback operates on entries from the LRU. By the time > > > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should > > > > will have emptied out the LRU and tree. > > > > > > > > Writeback could have gotten a refcount to the entry and dropped the > > > > tree->lock. But then it does __read_swap_cache_async(), and while > > > > holding the page lock checks the tree under lock once more; if that > > > > finds the entry valid, it means try_to_unuse() hasn't started on this > > > > page yet, and would be held up by the page lock/writeback state. > > > > > > Consider the following race: > > > > > > CPU 1 CPU 2 > > > # In shrink_memcg_cb() # In swap_off > > > list_lru_isolate() > > > zswap_invalidate() > > > .. > > > zswap_swapoff() -> kfree(tree) > > > spin_lock(&tree->lock); > > > > > > Isn't this a UAF or am I missing something here? > > > > Oof. You're right, it looks like there is a bug. Digging through the > > history, I think this is actually quite old: the original backend > > shrinkers would pluck something off their LRU, drop all locks, then > > try to acquire tree->lock. There is no protection against swapoff. > > > > The lock that is supposed to protect this is the LRU lock. That's > > where reclaim and invalidation should synchronize. But it's not right: > > > > 1. We drop the LRU lock before acquiring the tree lock. We should > > instead trylock the tree while still holding the LRU lock to make > > sure the tree is safe against swapoff. > > > > 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But > > it must always cycle the LRU lock before freeing the tree for that > > synchronization to work. > > > > Once we're holding a refcount to the entry, it's safe to drop all > > locks for the next step because we'll then work against the swapcache > > and entry: __read_swap_cache_async() will try to pin and lock whatever > > swap entry is at that type+offset. This also pins the type's current > > tree. HOWEVER, if swapoff + swapon raced, this could be a different > > tree than what we had in @tree, so > > > > 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to > > look up zswap_trees[] again after __read_swap_cache_async() > > succeeded to validate the entry. > > > > Once it succeeded, we can validate the entry. The entry is valid due > > to our refcount. The zswap_trees[type] is valid due to the cache pin. > > > > However, if validation failed and we have a non-zero writeback_result, > > there is one last bug: > > > > 4. the original entry's tree is no longer valid for the entry put. > > > > The current scheme handles invalidation fine (which is good because > > that's quite common). But it's fundamentally unsynchronized against > > swapoff (which has probably gone undetected because that's rare). > > > > I can't think of an immediate solution to this, but I wanted to put my > > analysis out for comments. > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > The first solution that came to mind for me was refcounting the zswap > tree with RCU with percpu-refcount, similar to how cgroup refs are > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > the percpu-refcount may be an overkill in terms of memory usage > though. I think we can still do our own refcounting with RCU, but it > may be more complicated. FWIW, I was able to reproduce the problem in a vm with the following kernel diff: diff --git a/mm/zswap.c b/mm/zswap.c index 78df16d307aa8..6580a4be52a18 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -880,6 +880,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o */ spin_unlock(lock); + pr_info("Sleeping in shrink_memcg_cb() before spin_lock(&tree->lock)\n"); + schedule_timeout_uninterruptible(msecs_to_jiffies(10 * 1000)); + /* Check for invalidate() race */ spin_lock(&tree->lock); if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) This basically expands the race window to 10 seconds. I have a reproducer script attached that utilizes the zswap shrinker (which makes this much easier to reproduce btw). The steps are as follows: - Compile the kernel with the above diff, and both ZRAM & KASAN enabled. - In one terminal, run zswap_wb_swapoff_race.sh. - In a different terminal, once the "Sleeping in shrink_memcg_cb()" message is logged, run "swapoff /dev/zram0". - In the first terminal, once the 10 seconds elapse, I get a UAF BUG from KASAN (log attached as well).
#!/bin/bash TOTAL_MB=100 TOTAL_BYTES=$((TOTAL_MB << 20)) ROOT_CG=/sys/fs/cgroup ZRAM_DEV=/dev/zram0 TMPFS=./tmpfs A_ASCII=$(printf '%d' "'A") function cleanup() { rm -rf $TMPFS/* umount $TMPFS rmdir $TMPFS if swapon -s | grep -q $ZRAM_DEV; then swapoff $ZRAM_DEV fi echo 1 > /sys/block/zram0/reset rmdir $ROOT_CG/test } trap cleanup EXIT echo 1 > /sys/module/zswap/parameters/shrinker_enabled # Initialize tmpfs mkdir $TMPFS mount -t tmpfs none $TMPFS # Initialize zram echo $((TOTAL_MB*2))m > /sys/block/zram0/disksize mkswap -L zram0 $ZRAM_DEV swapon $ZRAM_DEV # Initialize cgroup echo "+memory" > $ROOT_CG/cgroup.subtree_control mkdir $ROOT_CG/test # Create a test tmpfs file containing the alphabet on repeat ( echo 0 > $ROOT_CG/test/cgroup.procs s=$(printf '%s' {A..Z}) yes $s | dd iflag=fullblock of=$TMPFS/test bs=1M count=$TOTAL_MB status=none ) # Reclaim everything (invoking the zswap shrinker) echo $TOTAL_BYTES > $ROOT_CG/test/memory.reclaim
Attachment:
uaf_log
Description: Binary data