Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 23 Jan 2024 13:02:10 -0800

On Tue, Jan 23, 2024 at 12:12 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Tue, Jan 23, 2024 at 07:54:49AM -0800, Yosry Ahmed wrote:
> > On Tue, Jan 23, 2024 at 7:38 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > >
> > > On Mon, Jan 22, 2024 at 12:39:16PM -0800, Yosry Ahmed wrote:
> > > > On Mon, Jan 22, 2024 at 12:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > > >
> > > > > On Sat, Jan 20, 2024 at 02:40:07AM +0000, Yosry Ahmed wrote:
> > > > > > During swapoff, try_to_unuse() makes sure that zswap_invalidate() is
> > > > > > called for all swap entries before zswap_swapoff() is called. This means
> > > > > > that all zswap entries should already be removed from the tree. Simplify
> > > > > > zswap_swapoff() by removing the tree cleanup loop, and leaving an
> > > > > > assertion in its place.
> > > > > >
> > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > > > >
> > > > > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > > >
> > > > > That's a great simplification.
> > > > >
> > > > > Removing the tree->lock made me double take, but at this point the
> > > > > swapfile and its cache should be fully dead and I don't see how any of
> > > > > the zswap operations that take tree->lock could race at this point.
> > > >
> > > > It took me a while staring at the code to realize this loop is pointless.
> > > >
> > > > However, while I have your attention on the swapoff path, there's a
> > > > slightly irrelevant problem that I think might be there, but I am not
> > > > sure.
> > > >
> > > > It looks to me like swapoff can race with writeback, and there may be
> > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff()
> > > > races with shrink_memcg_cb(), I feel like we may free the tree as it
> > > > is being used. For example if zswap_swapoff()->kfree(tree) happen
> > > > right before shrink_memcg_cb()->list_lru_isolate(l, item).
> > > >
> > > > Please tell me that I am being paranoid and that there is some
> > > > protection against zswap writeback racing with swapoff. It feels like
> > > > we are very careful with zswap entries refcounting, but not with the
> > > > zswap tree itself.
> > >
> > > Hm, I don't see how.
> > >
> > > Writeback operates on entries from the LRU. By the time
> > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should
> > > will have emptied out the LRU and tree.
> > >
> > > Writeback could have gotten a refcount to the entry and dropped the
> > > tree->lock. But then it does __read_swap_cache_async(), and while
> > > holding the page lock checks the tree under lock once more; if that
> > > finds the entry valid, it means try_to_unuse() hasn't started on this
> > > page yet, and would be held up by the page lock/writeback state.
> >
> > Consider the following race:
> >
> > CPU 1                                 CPU 2
> > # In shrink_memcg_cb()     # In swap_off
> > list_lru_isolate()
> >                                             zswap_invalidate()
> >                                             ..
> >                                             zswap_swapoff() -> kfree(tree)
> > spin_lock(&tree->lock);
> >
> > Isn't this a UAF or am I missing something here?
>
> Oof. You're right, it looks like there is a bug. Digging through the
> history, I think this is actually quite old: the original backend
> shrinkers would pluck something off their LRU, drop all locks, then
> try to acquire tree->lock. There is no protection against swapoff.
>
> The lock that is supposed to protect this is the LRU lock. That's
> where reclaim and invalidation should synchronize. But it's not right:
>
> 1. We drop the LRU lock before acquiring the tree lock. We should
>    instead trylock the tree while still holding the LRU lock to make
>    sure the tree is safe against swapoff.
>
> 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But
>    it must always cycle the LRU lock before freeing the tree for that
>    synchronization to work.
>
> Once we're holding a refcount to the entry, it's safe to drop all
> locks for the next step because we'll then work against the swapcache
> and entry: __read_swap_cache_async() will try to pin and lock whatever
> swap entry is at that type+offset. This also pins the type's current
> tree. HOWEVER, if swapoff + swapon raced, this could be a different
> tree than what we had in @tree, so
>
> 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to
>    look up zswap_trees[] again after __read_swap_cache_async()
>    succeeded to validate the entry.
>
> Once it succeeded, we can validate the entry. The entry is valid due
> to our refcount. The zswap_trees[type] is valid due to the cache pin.
>
> However, if validation failed and we have a non-zero writeback_result,
> there is one last bug:
>
> 4. the original entry's tree is no longer valid for the entry put.
>
> The current scheme handles invalidation fine (which is good because
> that's quite common). But it's fundamentally unsynchronized against
> swapoff (which has probably gone undetected because that's rare).
>
> I can't think of an immediate solution to this, but I wanted to put my
> analysis out for comments.

Thanks for the great analysis, I missed the swapoff/swapon race myself :)

The first solution that came to mind for me was refcounting the zswap
tree with RCU with percpu-refcount, similar to how cgroup refs are
handled (init in zswap_swapon() and kill in zswap_swapoff()). I think
the percpu-refcount may be an overkill in terms of memory usage
though. I think we can still do our own refcounting with RCU, but it
may be more complicated.