On Thu, Jan 25, 2024 at 2:31 PM Chris Li <chrisl@xxxxxxxxxx> wrote: > > Hi Yosry, > > On Thu, Jan 25, 2024 at 12:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Thu, Jan 25, 2024 at 10:55 AM Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > > > Hi Yosry, > > > > > > On Wed, Jan 24, 2024 at 11:59 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@xxxxxxxxxx> wrote: > > > > > > > > > > Hi Yosry, > > > > > > > > > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :) > > > > > > > > > > > > > > The first solution that came to mind for me was refcounting the zswap > > > > > > > tree with RCU with percpu-refcount, similar to how cgroup refs are > > > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think > > > > > > > the percpu-refcount may be an overkill in terms of memory usage > > > > > > > though. I think we can still do our own refcounting with RCU, but it > > > > > > > may be more complicated. > > > > > > > > > > > > FWIW, I was able to reproduce the problem in a vm with the following > > > > > > kernel diff: > > > > > > > > > > Thanks for the great find. > > > > > > > > > > I was worry about the usage after free situation in this email: > > > > > > > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@xxxxxxxxxxxxxx/ > > > > > > > > > > Glad you are able to find a reproducible case. That is one of the > > > > > reasons I change the free to invalidate entries in my xarray patch. > > > > > > > > > > I think the swap_off code should remove the entry from the tree, just > > > > > wait for each zswap entry to drop to zero. Then free it. > > > > > > > > This doesn't really help. The swapoff code is already removing all the > > > > entries from the trees before zswap_swapoff() is called through > > > > zswap_invalidate(). The race I described occurs because the writeback > > > > code is accessing the entries through the LRU, not the tree. The > > > > > > Why? > > > Entry not in the tree is fine. What you describe is that, swap_off > > > code will not see the entry because it is already not in the tree. > > > The last one holding the reference count would free it. > > > > > > > writeback code could have isolated a zswap entry from the LRU before > > > > swapoff, then tried to access it after swapoff. Although the zswap > > > > > > The writeback should have a reference count to the zswap entry it > > > holds. The entry will not be free until the LRU is done and drop the > > > reference count to zero. > > > > > > > entry itself is referenced and safe to use, accessing the tree to grab > > > > the tree lock and check if the entry is still in the tree is the > > > > problem. > > > > > > The swap off should wait until all the LRU list from that tree has > > > been consumed before destroying the tree. > > > In swap off, it walks all the process MM anyway, walking all the memcg > > > and finding all the zswap entries in zswap LRU should solve that > > > problem. > > > > At that point, the entry is isolated from the zswap LRU list as well. > > So even if swap off iterates the zswap LRUs, it cannot find it to wait > > It just means that we need to defer removing the entry from LRU, only > remove it after most of the write back is complete to some critical > steps. > > > for it. Take a closer look at the race condition I described. The > > I take a closer look at the sequence Chengming describe, it has the > element of delay removing entry from the LRU as well. > > > problem is that after the entry is isolated from the zswap LRU, we > > need to grab the tree lock to make sure it's still there and get a > > ref, and just trying to lock the tree may be a UAF if we race with > > swapoff. > > I feel it is very wrong to have the tree freed while having > outstanding entry allocationed from the tree pending. > I would want to avoid that situation if possible. This should be the case with Chengming's solution. > > > > > > Anyway, I think it is easier to reason about the behavior that way. > > > Swap off will take the extra hit, but not the normal access of the > > > tree. > > > > I think this adds a lot of unnecessary complexity tbh. I think the > > operations reordering that Chengming described may be the way to go > > here. It does not include adding more logic or synchronization > > Does not require adding the tree reference count right? No, just reordering of operations in the writeback path. > > > primitives, just reordering operations to be closer to what we do in > > zswap_load() and leverage existing synchronization. > > The complexity is mostly for avoiding the tree reference count. If we > don't add the tree refcount and we don't need the extra complexity in > the tree waiting for LRU, that sounds great to me. I think that's what Chengming's proposal provides.