Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 23 Jan 2024 15:12:34 -0500

On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote:
> On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote:
> > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > >
> > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote:
> > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is
> > > > > called for all swap entries before zswap_swapoff() is called. This means
> > > > > that all zswap entries should already be removed from the tree. Simplify
> > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an
> > > > > assertion in its place.
> > > > >
> > > > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > >
> > > > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > >
> > > > That's a great simplification.
> > > >
> > > > Removing the tree->lock made me double take, but at this point the
> > > > swapfile and its cache should be fully dead and I don't see how any of
> > > > the zswap operations that take tree->lock could race at this point.
> > >
> > > It took me a while staring at the code to realize this loop is pointless.
> > >
> > > However, while I have your attention on the swapoff path, there's a
> > > slightly irrelevant problem that I think might be there, but I am not
> > > sure.
> > >
> > > It looks to me like swapoff can race with writeback, and there may be
> > > a chance of UAF for the zswap tree. For example, if zswap_swapoff()
> > > races with shrink_memcg_cb(), I feel like we may free the tree as it
> > > is being used. For example if zswap_swapoff()->kfree(tree) happen
> > > right before shrink_memcg_cb()->list_lru_isolate(l, item).
> > >
> > > Please tell me that I am being paranoid and that there is some
> > > protection against zswap writeback racing with swapoff. It feels like
> > > we are very careful with zswap entries refcounting, but not with the
> > > zswap tree itself.
> >
> > Hm, I don't see how.
> >
> > Writeback operates on entries from the LRU. By the time
> > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should
> > will have emptied out the LRU and tree.
> >
> > Writeback could have gotten a refcount to the entry and dropped the
> > tree->lock. But then it does __read_swap_cache_async(), and while
> > holding the page lock checks the tree under lock once more; if that
> > finds the entry valid, it means try_to_unuse() hasn't started on this
> > page yet, and would be held up by the page lock/writeback state.
> 
> Consider the following race:
> 
> CPU 1                                 CPU 2
> # In shrink_memcg_cb()     # In swap_off
> list_lru_isolate()
>                                             zswap_invalidate()
>                                             ..
>                                             zswap_swapoff() -> kfree(tree)
> spin_lock(&tree->lock);
> 
> Isn't this a UAF or am I missing something here?

Oof. You're right, it looks like there is a bug. Digging through the
history, I think this is actually quite old: the original backend
shrinkers would pluck something off their LRU, drop all locks, then
try to acquire tree->lock. There is no protection against swapoff.

The lock that is supposed to protect this is the LRU lock. That's
where reclaim and invalidation should synchronize. But it's not right:

1. We drop the LRU lock before acquiring the tree lock. We should
   instead trylock the tree while still holding the LRU lock to make
   sure the tree is safe against swapoff.

2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But
   it must always cycle the LRU lock before freeing the tree for that
   synchronization to work.

Once we're holding a refcount to the entry, it's safe to drop all
locks for the next step because we'll then work against the swapcache
and entry: __read_swap_cache_async() will try to pin and lock whatever
swap entry is at that type+offset. This also pins the type's current
tree. HOWEVER, if swapoff + swapon raced, this could be a different
tree than what we had in @tree, so

3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to
   look up zswap_trees[] again after __read_swap_cache_async()
   succeeded to validate the entry.

Once it succeeded, we can validate the entry. The entry is valid due
to our refcount. The zswap_trees[type] is valid due to the cache pin.

However, if validation failed and we have a non-zero writeback_result,
there is one last bug:

4. the original entry's tree is no longer valid for the entry put.

The current scheme handles invalidation fine (which is good because
that's quite common). But it's fundamentally unsynchronized against
swapoff (which has probably gone undetected because that's rare).

I can't think of an immediate solution to this, but I wanted to put my
analysis out for comments.