Re: [PATCH v2] mm: zswap: properly synchronize freeing resources during CPU hotunplug

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 8 Jan 2025 16:25:56 -0800

On Wed, Jan 8, 2025 at 4:12 PM Sridhar, Kanchana P
<kanchana.p.sridhar@xxxxxxxxx> wrote:
>
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > Sent: Wednesday, January 8, 2025 2:25 PM
> > To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx>; Nhat Pham
> > <nphamcs@xxxxxxxxx>; Chengming Zhou <chengming.zhou@xxxxxxxxx>;
> > Vitaly Wool <vitalywool@xxxxxxxxx>; Barry Song <baohua@xxxxxxxxxx>; Sam
> > Sun <samsun1006219@xxxxxxxxx>; Sridhar, Kanchana P
> > <kanchana.p.sridhar@xxxxxxxxx>; linux-mm@xxxxxxxxx; linux-
> > kernel@xxxxxxxxxxxxxxx; Yosry Ahmed <yosryahmed@xxxxxxxxxx>;
> > stable@xxxxxxxxxxxxxxx
> > Subject: [PATCH v2] mm: zswap: properly synchronize freeing resources
> > during CPU hotunplug
> >
> > In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of
> > the
> > current CPU at the beginning of the operation is retrieved and used
> > throughout.  However, since neither preemption nor migration are
> > disabled, it is possible that the operation continues on a different
> > CPU.
> >
> > If the original CPU is hotunplugged while the acomp_ctx is still in use,
> > we run into a UAF bug as some of the resources attached to the acomp_ctx
> > are freed during hotunplug in zswap_cpu_comp_dead() (i.e.
> > acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp).
> >
> > The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to
> > use crypto_acomp API for hardware acceleration") when the switch to the
> > crypto_acomp API was made.  Prior to that, the per-CPU crypto_comp was
> > retrieved using get_cpu_ptr() which disables preemption and makes sure
> > the CPU cannot go away from under us.  Preemption cannot be disabled
> > with the crypto_acomp API as a sleepable context is needed.
> >
> > Use the acomp_ctx.mutex to synchronize CPU hotplug callbacks allocating
> > and freeing resources with compression/decompression paths. Make sure
> > that acomp_ctx.req is NULL when the resources are freed. In the
> > compression/decompression paths, check if acomp_ctx.req is NULL after
> > acquiring the mutex (meaning the CPU was offlined) and retry on the new
> > CPU.
> >
> > The initialization of acomp_ctx.mutex is moved from the CPU hotplug
> > callback to the pool initialization where it belongs (where the mutex is
> > allocated). In addition to adding clarity, this makes sure that CPU
> > hotplug cannot reinitialize a mutex that is already locked by
> > compression/decompression.
> >
> > Previously a fix was attempted by holding cpus_read_lock() [1]. This
> > would have caused a potential deadlock as it is possible for code
> > already holding the lock to fall into reclaim and enter zswap (causing a
> > deadlock). A fix was also attempted using SRCU for synchronization, but
> > Johannes pointed out that synchronize_srcu() cannot be used in CPU
> > hotplug notifiers [2].
> >
> > Alternative fixes that were considered/attempted and could have worked:
> > - Refcounting the per-CPU acomp_ctx. This involves complexity in
> >   handling the race between the refcount dropping to zero in
> >   zswap_[de]compress() and the refcount being re-initialized when the
> >   CPU is onlined.
> > - Disabling migration before getting the per-CPU acomp_ctx [3], but
> >   that's discouraged and is a much bigger hammer than needed, and could
> >   result in subtle performance issues.
> >
> > [1]https://lkml.kernel.org/20241219212437.2714151-1-
> > yosryahmed@xxxxxxxxxx/
> > [2]https://lkml.kernel.org/20250107074724.1756696-2-
> > yosryahmed@xxxxxxxxxx/
> > [3]https://lkml.kernel.org/20250107222236.2715883-2-
> > yosryahmed@xxxxxxxxxx/
> >
> > Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for
> > hardware acceleration")
> > Cc: <stable@xxxxxxxxxxxxxxx>
> > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
> > Reported-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Closes:
> > https://lore.kernel.org/lkml/20241113213007.GB1564047@xxxxxxxxxxx/
> > Reported-by: Sam Sun <samsun1006219@xxxxxxxxx>
> > Closes:
> > https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4O
> > curuL4tPg6OaQ@xxxxxxxxxxxxxx/
> > ---
> >
> > This applies on top of the latest mm-hotfixes-unstable on top of 'Revert
> > "mm: zswap: fix race between [de]compression and CPU hotunplug"' and
> > after 'mm: zswap: disable migration while using per-CPU acomp_ctx' was
> > dropped.
> >
> > v1 -> v2:
> > - Move the initialization of the mutex to pool initialization.
> > - Use the mutex to also synchronize with the CPU hotplug callback (i.e.
> >   zswap_cpu_comp_prep()).
> > - Naming cleanups.
> >
> > ---
> >  mm/zswap.c | 60 +++++++++++++++++++++++++++++++++++++++++---------
> > ----
> >  1 file changed, 46 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f6316b66fb236..4d7e564732267 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -251,7 +251,7 @@ static struct zswap_pool *zswap_pool_create(char
> > *type, char *compressor)
> >       struct zswap_pool *pool;
> >       char name[38]; /* 'zswap' + 32 char (max) num + \0 */
> >       gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN |
> > __GFP_KSWAPD_RECLAIM;
> > -     int ret;
> > +     int ret, cpu;
> >
> >       if (!zswap_has_pool) {
> >               /* if either are unset, pool initialization failed, and we
> > @@ -285,6 +285,9 @@ static struct zswap_pool *zswap_pool_create(char
> > *type, char *compressor)
> >               goto error;
> >       }
> >
> > +     for_each_possible_cpu(cpu)
> > +             mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > +
> >       ret =
> > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> >                                      &pool->node);
> >       if (ret)
> > @@ -821,11 +824,12 @@ static int zswap_cpu_comp_prepare(unsigned int
> > cpu, struct hlist_node *node)
> >       struct acomp_req *req;
> >       int ret;
> >
> > -     mutex_init(&acomp_ctx->mutex);
> > -
> > +     mutex_lock(&acomp_ctx->mutex);
> >       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> > cpu_to_node(cpu));
> > -     if (!acomp_ctx->buffer)
> > -             return -ENOMEM;
> > +     if (!acomp_ctx->buffer) {
> > +             ret = -ENOMEM;
> > +             goto buffer_fail;
> > +     }
> >
> >       acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> > cpu_to_node(cpu));
> >       if (IS_ERR(acomp)) {
> > @@ -844,6 +848,8 @@ static int zswap_cpu_comp_prepare(unsigned int
> > cpu, struct hlist_node *node)
> >               ret = -ENOMEM;
> >               goto req_fail;
> >       }
> > +
> > +     /* acomp_ctx->req must be NULL if the acomp_ctx is not fully
> > initialized */
> >       acomp_ctx->req = req;
>
> For this to happen, shouldn't we directly assign:
>  acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
>  if (!acomp_ctx->req) { ...}

For the initial zswap_cpu_comp_prepare() call on a CPU, it doesn't
matter because a failure will cause a failure of initialization or CPU
onlining. For calls after a CPU is offlined, zswap_cpu_comp_dead()
will have set acomp_ctx->req to NULL, so leaving it unmodified keeps
it as NULL.

Although I suppose the comment is not really clear because
zswap_cpu_comp_prepare() could fail before acomp_request_alloc() is
ever called and this would not be a problem (as the onlining will
fail).

Maybe it's best to just drop the comment.

Andrew, could you please fold this in?

diff --git a/mm/zswap.c b/mm/zswap.c
index 4d7e564732267..30f5a27a68620 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -848,8 +848,6 @@ static int zswap_cpu_comp_prepare(unsigned int
cpu, struct hlist_node *node)
                ret = -ENOMEM;
                goto req_fail;
        }
-
-       /* acomp_ctx->req must be NULL if the acomp_ctx is not fully
initialized */
        acomp_ctx->req = req;

        crypto_init_wait(&acomp_ctx->wait);

>
> I was wondering how error conditions encountered in zswap_cpu_comp_prepare()
> will impact zswap_[de]compress(). This is probably unrelated to this patch itself,
> but is my understanding correct that an error in this procedure will cause
> zswap_enabled to be set to false, which will cause any zswap_stores() to fail early?

Yeah that is my understanding, for the initial call. For later calls
when a CPU gets onlined I think the CPU onlining fails and the
acomp_ctx is not used.