On Tue, Jan 7, 2025 at 1:42 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Tue, Jan 7, 2025 at 12:16 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > On Tue, Jan 7, 2025 at 10:13 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > On Tue, Jan 7, 2025 at 10:03 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > > > On Tue, Jan 07, 2025 at 07:47:24AM +0000, Yosry Ahmed wrote: > > > > > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the > > > > > current CPU at the beginning of the operation is retrieved and used > > > > > throughout. However, since neither preemption nor migration are disabled, > > > > > it is possible that the operation continues on a different CPU. > > > > > > > > > > If the original CPU is hotunplugged while the acomp_ctx is still in use, > > > > > we run into a UAF bug as the resources attached to the acomp_ctx are freed > > > > > during hotunplug in zswap_cpu_comp_dead(). > > > > > > > > > > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use > > > > > crypto_acomp API for hardware acceleration") when the switch to the > > > > > crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was > > > > > retrieved using get_cpu_ptr() which disables preemption and makes sure the > > > > > CPU cannot go away from under us. Preemption cannot be disabled with the > > > > > crypto_acomp API as a sleepable context is needed. > > > > > > > > > > Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to > > > > > per-acomp_ctx") increased the UAF surface area by making the per-CPU > > > > > buffers dynamic, adding yet another resource that can be freed from under > > > > > zswap compression/decompression by CPU hotunplug. > > > > > > > > > > There are a few ways to fix this: > > > > > (a) Add a refcount for acomp_ctx. > > > > > (b) Disable migration while using the per-CPU acomp_ctx. > > > > > (c) Use SRCU to wait for other CPUs using the acomp_ctx of the CPU being > > > > > hotunplugged. Normal RCU cannot be used as a sleepable context is > > > > > required. > > > > > > > > > > Implement (c) since it's simpler than (a), and (b) involves using > > > > > migrate_disable() which is apparently undesired (see huge comment in > > > > > include/linux/preempt.h). > > > > > > > > > > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") > > > > > Cc: <stable@xxxxxxxxxxxxxxx> > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > > > > Reported-by: Johannes Weiner <hannes@xxxxxxxxxxx> > > > > > Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@xxxxxxxxxxx/ > > > > > Reported-by: Sam Sun <samsun1006219@xxxxxxxxx> > > > > > Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@xxxxxxxxxxxxxx/ > > > > > --- > > > > > mm/zswap.c | 31 ++++++++++++++++++++++++++++--- > > > > > 1 file changed, 28 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > index f6316b66fb236..add1406d693b8 100644 > > > > > --- a/mm/zswap.c > > > > > +++ b/mm/zswap.c > > > > > @@ -864,12 +864,22 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) > > > > > return ret; > > > > > } > > > > > > > > > > +DEFINE_STATIC_SRCU(acomp_srcu); > > > > > + > > > > > static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > > > > > { > > > > > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > > > > > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > > > > > > > > > > if (!IS_ERR_OR_NULL(acomp_ctx)) { > > > > > + /* > > > > > + * Even though the acomp_ctx should not be currently in use on > > > > > + * @cpu, it may still be used by compress/decompress operations > > > > > + * that started on @cpu and migrated to a different CPU. Wait > > > > > + * for such usages to complete, any news usages would be a bug. > > > > > + */ > > > > > + synchronize_srcu(&acomp_srcu); > > > > > > > > The docs suggest you can't solve it like that :( > > > > > > > > Documentation/RCU/Design/Requirements/Requirements.rst: > > > > > > > > Also unlike other RCU flavors, synchronize_srcu() may **not** be > > > > invoked from CPU-hotplug notifiers, due to the fact that SRCU grace > > > > periods make use of timers and the possibility of timers being > > > > temporarily “stranded” on the outgoing CPU. This stranding of timers > > > > means that timers posted to the outgoing CPU will not fire until > > > > late in the CPU-hotplug process. The problem is that if a notifier > > > > is waiting on an SRCU grace period, that grace period is waiting on > > > > a timer, and that timer is stranded on the outgoing CPU, then the > > > > notifier will never be awakened, in other words, deadlock has > > > > occurred. This same situation of course also prohibits > > > > srcu_barrier() from being invoked from CPU-hotplug notifiers. > > > > > > Thanks for checking, I completely missed this. I guess it only works > > > with SRCU if we use call_srcu(), but then we need to copy the pointers > > > to a new struct to avoid racing with the CPU getting onlined again. > > > Otherwise we can just bite the bullet and add a refcount, or use > > > migrate_disable() despite that being undesirable. > > > > > > Do you have a favorite? :) > > > > I briefly looked into refcounting. The annoying thing is that we need > > to handle the race between putting the last refcount in > > zswap_compress()/zswap_decompress(), and the CPU getting onlined again > > and re-initializing the refcount. One way to do it would be to put all > > dynamically allocated resources in a struct with the same struct with > > the new refcount, and use RCU + refcounts to allocate and free the > > struct as a whole. > > > > I am leaning toward just disabling migration for now tbh unless there > > are objections to that, especially this close to the v6.13 release. > > (Sorry for going back and forth on this, I am essentially thinking out loud) > > Actually, as Kanchana mentioned before, we should be able to just hold > the mutex in zswap_cpu_comp_dead() before freeing the dynamic > resources. The mutex is allocated when the pool is created and will > not go away during CPU hotunplug AFAICT. It confused me before because > we call mutex_init() in zswap_cpu_comp_prepare(), but it really should > be in zswap_pool_create() after we allocate the pool->acomp_ctx. Nope. It's possible for zswap_cpu_comp_dead() to hold the mutex and free the resources after zswap_[de]compress() calls raw_cpu_ptr() but before it calls mutex_lock().