Hi Yosry, On Thu, Jan 25, 2024 at 12:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Thu, Jan 25, 2024 at 10:55 AM Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > Hi Yosry, > > > > On Wed, Jan 24, 2024 at 11:59 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@xxxxxxxxxx> wrote: > > > > > > > > Hi Yosry, > > > > > > > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > > > > > > > > > The first solution that came to mind for me was refcounting the zswap > > > > > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > > > > > the percpu-refcount may be an overkill in terms of memory usage > > > > > > though. I think we can still do our own refcounting with RCU, but it > > > > > > may be more complicated. > > > > > > > > > > FWIW, I was able to reproduce the problem in a vm with the following > > > > > kernel diff: > > > > > > > > Thanks for the great find. > > > > > > > > I was worry about the usage after free situation in this email: > > > > > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@xxxxxxxxxxxxxx/ > > > > > > > > Glad you are able to find a reproducible case. That is one of the > > > > reasons I change the free to invalidate entries in my xarray patch. > > > > > > > > I think the swap_off code should remove the entry from the tree, just > > > > wait for each zswap entry to drop to zero. Then free it. > > > > > > This doesn't really help. The swapoff code is already removing all the > > > entries from the trees before zswap_swapoff() is called through > > > zswap_invalidate(). The race I described occurs because the writeback > > > code is accessing the entries through the LRU, not the tree. The > > > > Why? > > Entry not in the tree is fine. What you describe is that, swap_off > > code will not see the entry because it is already not in the tree. > > The last one holding the reference count would free it. > > > > > writeback code could have isolated a zswap entry from the LRU before > > > swapoff, then tried to access it after swapoff. Although the zswap > > > > The writeback should have a reference count to the zswap entry it > > holds. The entry will not be free until the LRU is done and drop the > > reference count to zero. > > > > > entry itself is referenced and safe to use, accessing the tree to grab > > > the tree lock and check if the entry is still in the tree is the > > > problem. > > > > The swap off should wait until all the LRU list from that tree has > > been consumed before destroying the tree. > > In swap off, it walks all the process MM anyway, walking all the memcg > > and finding all the zswap entries in zswap LRU should solve that > > problem. > > At that point, the entry is isolated from the zswap LRU list as well. > So even if swap off iterates the zswap LRUs, it cannot find it to wait It just means that we need to defer removing the entry from LRU, only remove it after most of the write back is complete to some critical steps. > for it. Take a closer look at the race condition I described. The I take a closer look at the sequence Chengming describe, it has the element of delay removing entry from the LRU as well. > problem is that after the entry is isolated from the zswap LRU, we > need to grab the tree lock to make sure it's still there and get a > ref, and just trying to lock the tree may be a UAF if we race with > swapoff. I feel it is very wrong to have the tree freed while having outstanding entry allocationed from the tree pending. I would want to avoid that situation if possible. > > > Anyway, I think it is easier to reason about the behavior that way. > > Swap off will take the extra hit, but not the normal access of the > > tree. > > I think this adds a lot of unnecessary complexity tbh. I think the > operations reordering that Chengming described may be the way to go > here. It does not include adding more logic or synchronization Does not require adding the tree reference count right? > primitives, just reordering operations to be closer to what we do in > zswap_load() and leverage existing synchronization. The complexity is mostly for avoiding the tree reference count. If we don't add the tree refcount and we don't need the extra complexity in the tree waiting for LRU, that sounds great to me. Chris