RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Sat, 8 Mar 2025 02:47:15 +0000

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> Sent: Friday, March 7, 2025 11:30 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>;
> Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Fri, Mar 07, 2025 at 12:01:14AM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> > > Sent: Thursday, March 6, 2025 11:36 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx;
> chengming.zhou@xxxxxxxxx;
> > > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> > > ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux-
> > > crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> > > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>;
> > > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> > > allocation/deletion and mutex lock usage.
> > >
> > > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote:
> > > > This patch modifies the acomp_ctx resources' lifetime to be from pool
> > > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to
> > > > "struct crypto_acomp_ctx" which simplify a few things:
> > > >
> > > > 1) zswap_pool_create() will initialize all members of each percpu
> > > acomp_ctx
> > > >    to 0 or NULL and only then initialize the mutex.
> > > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online
> > > >    to true, without locking the mutex.
> > > > 3) CPU hotunplug will lock the mutex before setting __online to false. It
> > > >    will not delete any resources.
> > > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online
> > > >    is true, and if so, return the mutex for use in zswap compress and
> > > >    decompress ops.
> > > > 5) CPU onlining after offlining will simply check if either __online or
> > > >    nr_reqs are non-0, and return 0 if so, without re-allocating the
> > > >    resources.
> > > > 6) zswap_pool_destroy() will call a newly added
> zswap_cpu_comp_dealloc()
> > > to
> > > >    delete the acomp_ctx resources.
> > > > 7) Common resource deletion code in case of
> zswap_cpu_comp_prepare()
> > > >    errors, and for use in zswap_cpu_comp_dealloc(), is factored into a
> new
> > > >    acomp_ctx_dealloc().
> > > >
> > > > The CPU hot[un]plug callback functions are moved to "pool functions"
> > > > accordingly.
> > > >
> > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon
> > > CPU
> > > > offlining, and only deleting them when the pool is destroyed, is as
> follows:
> > > >
> > > >     IAA with batching: 64.8 KB
> > > >     Software compressors: 8.2 KB
> > > >
> > > > I would appreciate code review comments on whether this memory cost
> is
> > > > acceptable, for the latency improvement that it provides due to a faster
> > > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the
> > > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0,
> and
> > > > if so, set __online to true and return, and reclaim can proceed.
> > >
> > > I like the idea of allocating the resources on memory hotplug but
> > > leaving them allocated until the pool is torn down. It avoids allocating
> > > unnecessary memory if some CPUs are never onlined, but it simplifies
> > > things because we don't have to synchronize against the resources being
> > > freed in CPU offline.
> > >
> > > The only case that would suffer from this AFAICT is if someone onlines
> > > many CPUs, uses them once, and then offline them and not use them
> again.
> > > I am not familiar with CPU hotplug use cases so I can't tell if that's
> > > something people do, but I am inclined to agree with this
> > > simplification.
> >
> > Thanks Yosry, for your code review comments! Good to know that this
> > simplification is acceptable.
> >
> > >
> > > >
> > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@xxxxxxxxx>
> > > > ---
> > > >  mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++----------
> ----
> > > ----
> > > >  1 file changed, 182 insertions(+), 91 deletions(-)
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 10f2a16e7586..cff96df1df8b 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void)
> > > >  struct crypto_acomp_ctx {
> > > >  	struct crypto_acomp *acomp;
> > > >  	struct acomp_req *req;
> > > > -	struct crypto_wait wait;
> > >
> > > Is there a reason for moving this? If not please avoid unrelated changes.
> >
> > The reason is so that req/buffer, and reqs/buffers with batching, go together
> > logically, hence I found this easier to understand. I can restore this to the
> > original order, if that's preferable.
> 
> I see. In that case, this fits better in the patch that actually adds
> support for having multiple requests and buffers, and please call it out
> explicitly in the commit message.

Thanks Yosry, for the follow up comments. Sure, this makes sense.

> 
> >
> > >
> > > >  	u8 *buffer;
> > > > +	u8 nr_reqs;
> > > > +	struct crypto_wait wait;
> > > >  	struct mutex mutex;
> > > >  	bool is_sleepable;
> > > > +	bool __online;
> > >
> > > I don't believe we need this.
> > >
> > > If we are not freeing resources during CPU offlining, then we do not
> > > need a CPU offline callback and acomp_ctx->__online serves no purpose.
> > >
> > > The whole point of synchronizing between offlining and
> > > compress/decompress operations is to avoid UAF. If offlining does not
> > > free resources, then we can hold the mutex directly in the
> > > compress/decompress path and drop the hotunplug callback completely.
> > >
> > > I also believe nr_reqs can be dropped from this patch, as it seems like
> > > it's only used know when to set __online.
> >
> > All great points! In fact, that was the original solution I had implemented
> > (not having an offline callback). But then, I spent some time understanding
> > the v6.13 hotfix for synchronizing freeing of resources, and this comment
> > in zswap_cpu_comp_prepare():
> >
> > 	/*
> > 	 * Only hold the mutex after completing allocations, otherwise we
> may
> > 	 * recurse into zswap through reclaim and attempt to hold the mutex
> > 	 * again resulting in a deadlock.
> > 	 */
> >
> > Hence, I figured the constraint of "recurse into zswap through reclaim" was
> > something to comprehend in the simplification (even though I had a tough
> > time imagining how this could happen).
> 
> The constraint here is about zswap_cpu_comp_prepare() holding the mutex,
> making an allocation which internally triggers reclaim, then recursing
> into zswap and trying to hold the same mutex again causing a deadlock.
> 
> If zswap_cpu_comp_prepare() does not need to hold the mutex to begin
> with, the constraint naturally goes away.

Actually, if it is possible for the allocations in zswap_cpu_comp_prepare()
to trigger reclaim, then I believe we need some way for reclaim to know if
the acomp_ctx resources are available. Hence, this seems like a potential
for deadlock regardless of the mutex.

I verified that all the zswap_cpu_comp_prepare() allocations are done with
GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a
risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare()
in general, i.e., aside from this patchset.

I can think of the following options to resolve this, and would welcome
other suggestions:

1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check
    if acomp_ctx->__online is true, and if so, return the mutex. If
    acomp_ctx->__online is false, then it returns NULL. In other words, we
    don't have the for loop.
    - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare()
       to fail, cpuhotplug to fail. However, there is no deadlock.
        - zswap_compress() will need to detect NULL returned by
          acomp_ctx_get_cpu_lock(), and return an error.
        - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling
          acomp_ctx_get_cpu_lock().
    - We won't be migrated to a different CPU because we hold the mutex, hence
      zswap_cpu_comp_dead() will wait on the mutex.

2) More intrusive: We would need to use a gfp_t that prevents direct reclaim
    and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h,
    but for non-THP allocations. If we decide to adopt this approach, we would
    need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c
    to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this
    new gfp_t, in lieu of GFP_KERNEL.

> 
> >
> > Hence, I added the "bool __online" because zswap_cpu_comp_prepare()
> > does not acquire the mutex lock while allocating resources. We have
> already
> > initialized the mutex, so in theory, it is possible for compress/decompress
> > to acquire the mutex lock. The __online acts as a way to indicate whether
> > compress/decompress can proceed reliably to use the resources.
> 
> For compress/decompress to acquire the mutex they need to run on that
> CPU, and I don't think that's possible before onlining completes, so
> zswap_cpu_comp_prepare() must have already completed before
> compress/decompress can use that CPU IIUC.

If we can make this assumption, that would be great! However, I am not
totally sure because of the GFP_KERNEL allocations in
zswap_cpu_comp_prepare().

> 
> >
> > The "nr_reqs" was needed as a way to distinguish between initial and
> > subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that
> > goes through an online-offline-online sequence. In the initial onlining,
> > we need to allocate resources because nr_reqs=0. If resources are to
> > be allocated, we set acomp_ctx->nr_reqs and proceed to allocate
> > reqs/buffers/etc. In the subsequent onlining, we can quickly inspect
> > nr_reqs as being greater than 0 and return, thus avoiding any latency
> > delays before reclaim/page-faults can be handled on that CPU.
> >
> > Please let me know if this rationale seems reasonable for why
> > __online and nr_reqs were introduced.
> 
> Based on what I said, I still don't believe they are needed, but please
> correct me if I am wrong.

Same comments as above. 

> 
> [..]
> > > I also see some ordering changes inside the function (e.g. we now
> > > allocate the request before the buffer). Not sure if these are
> > > intentional. If not, please keep the diff to the required changes only.
> >
> > The reason for this was, I am trying to organize the allocations based
> > on dependencies. Unless requests are allocated, there is no point in
> > allocating buffers. Please let me know if this is Ok.
> 
> Please separate refactoring changes in general from functional changes
> because it makes code review harder.

Sure, I will do so.

> 
> In this specific instance, I think moving the code is probably not worth
> it, as there's also no point in allocating requests if we cannot
> allocate buffers. In fact, since the buffers are larger, in theory their
> allocation is more likely to fail, so it makes since to do it first.

Understood, makes better sense than allocating the requests first.

> 
> Anyway, please propose such refactoring changes separately and they can
> be discussed as such.

Ok.

> 
> [..]
> > > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct
> hlist_node
> > > *node)
> > > > +{
> > > > +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > node);
> > > > +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > >acomp_ctx, cpu);
> > > > +
> > > > +	/*
> > > > +	 * The lifetime of acomp_ctx resources is from pool creation to
> > > > +	 * pool deletion.
> > > > +	 *
> > > > +	 * Reclaims should not be happening because, we get to this routine
> > > only
> > > > +	 * in two scenarios:
> > > > +	 *
> > > > +	 * 1) pool creation failures before/during the pool ref initialization.
> > > > +	 * 2) we are in the process of releasing the pool, it is off the
> > > > +	 *    zswap_pools list and has no references.
> > > > +	 *
> > > > +	 * Hence, there is no need for locks.
> > > > +	 */
> > > > +	acomp_ctx->__online = false;
> > > > +	acomp_ctx_dealloc(acomp_ctx);
> > >
> > > Since __online can be dropped, we can probably drop
> > > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly?
> >
> > I suppose there is value in having a way in zswap to know for sure, that
> > resource allocation has completed, and it is safe for compress/decompress
> > to proceed. Especially because the mutex has been initialized before we
> > get to resource allocation. Would you agree?
> 
> As I mentioned above, I believe compress/decompress cannot run on a CPU
> before the onlining completes. Please correct me if I am wrong.
> 
> >
> > >
> > > > +}
> > > > +
> > > >  static struct zswap_pool *zswap_pool_create(char *type, char
> > > *compressor)
> > > >  {
> > > >  	struct zswap_pool *pool;
> > > > @@ -285,13 +403,21 @@ static struct zswap_pool
> > > *zswap_pool_create(char *type, char *compressor)
> > > >  		goto error;
> > > >  	}
> > > >
> > > > -	for_each_possible_cpu(cpu)
> > > > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > > > +	for_each_possible_cpu(cpu) {
> > > > +		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > >acomp_ctx, cpu);
> > > > +
> > > > +		acomp_ctx->acomp = NULL;
> > > > +		acomp_ctx->req = NULL;
> > > > +		acomp_ctx->buffer = NULL;
> > > > +		acomp_ctx->__online = false;
> > > > +		acomp_ctx->nr_reqs = 0;
> > >
> > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them
> > > right away?
> >
> > Yes, I figured this is needed for two reasons:
> >
> > 1) For the error handling in zswap_cpu_comp_prepare() and calls into
> >     zswap_cpu_comp_dealloc() to be handled by the common procedure
> >     "acomp_ctx_dealloc()" unambiguously.
> 
> This makes sense. When you move the refactoring to create
> acomp_ctx_dealloc() to a separate patch, please include this change in
> it and call it out explicitly in the commit message.

Sure.

> 
> > 2) The second scenario I thought of that would need this, is let's say
> >      the zswap compressor is switched immediately after setting the
> >      compressor. Some cores have executed the onlining code and
> >      some haven't. Because there are no pool refs held,
> >      zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured
> >      it would help to initialize these acomp_ctx members before the
> >      hand-off to "cpuhp_state_add_instance()" in zswap_pool_create().
> 
> I believe cpuhp_state_add_instance() calls the onlining function
> synchronously on all present CPUs, so I don't think it's possible to end
> up in a state where the pool is being destroyed and some CPU executed
> zswap_cpu_comp_prepare() while others haven't.

I looked at the cpuhotplug code some more. The startup callback is
invoked sequentially for_each_present_cpu(). If an error occurs for any
one of them, it calls the teardown callback only on the earlier cores that
have already finished running the startup callback. However, 
zswap_cpu_comp_dealloc() will be called for all cores, even the ones
for which the startup callback was not run. Hence, I believe the
zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO)
to allocate the acomp_ctx.

> 
> That being said, this made me think of a different problem. If pool
> destruction races with CPU onlining, there could be a race between
> zswap_cpu_comp_prepare() allocating resources and
> zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them.
> 
> I believe we must always call cpuhp_state_remove_instance() *before*
> freeing the resources to prevent this race from happening. This needs to
> be documented with a comment.

Yes, this race condition is possible, thanks for catching this! The problem with
calling cpuhp_state_remove_instance() before freeing the resources is that
cpuhp_state_add_instance() and cpuhp_state_remove_instance() both
acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence
are serialized.

For the reasons motivating why acomp_ctx->__online is set to false in
zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance()
before calling acomp_ctx_dealloc() because the latter could wait until
acomp_ctx->__online to be true before deleting the resources. I will
think about this some more.

Another possibility is to not rely on cpuhotplug in zswap, and instead
manage the per-cpu acomp_ctx resource allocation entirely in
zswap_pool_create(), and deletion entirely in zswap_pool_destroy(),
along with the necessary error handling. Let me think about this some
more as well.

> 
> Let me know if I missed something.
> 
> >
> > Please let me know if these are valid considerations.
> >
> > >
> > > If it is in fact needed we should probably just use __GFP_ZERO.
> >
> > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of
> "alloc_percpu()"
> > for the acomp_ctx?
> 
> Yeah if we need to initialize all/most fields to 0 let's use
> alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO.

Sounds good.

> 
> [..]
> > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx
> > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> > > >
> > > >  	for (;;) {
> > > >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > > > -		mutex_lock(&acomp_ctx->mutex);
> > > > -		if (likely(acomp_ctx->req))
> > > > -			return acomp_ctx;
> > > >  		/*
> > > > -		 * It is possible that we were migrated to a different CPU
> > > after
> > > > -		 * getting the per-CPU ctx but before the mutex was
> > > acquired. If
> > > > -		 * the old CPU got offlined, zswap_cpu_comp_dead() could
> > > have
> > > > -		 * already freed ctx->req (among other things) and set it to
> > > > -		 * NULL. Just try again on the new CPU that we ended up on.
> > > > +		 * If the CPU onlining code successfully allocates acomp_ctx
> > > resources,
> > > > +		 * it sets acomp_ctx->__online to true. Until this happens, we
> > > have
> > > > +		 * two options:
> > > > +		 *
> > > > +		 * 1. Return NULL and fail all stores on this CPU.
> > > > +		 * 2. Retry, until onlining has finished allocating resources.
> > > > +		 *
> > > > +		 * In theory, option 1 could be more appropriate, because it
> > > > +		 * allows the calling procedure to decide how it wants to
> > > handle
> > > > +		 * reclaim racing with CPU hotplug. For instance, it might be
> > > Ok
> > > > +		 * for compress to return an error for the backing swap device
> > > > +		 * to store the folio. Decompress could wait until we get a
> > > > +		 * valid and locked mutex after onlining has completed. For
> > > now,
> > > > +		 * we go with option 2 because adding a do-while in
> > > > +		 * zswap_decompress() adds latency for software
> > > compressors.
> > > > +		 *
> > > > +		 * Once initialized, the resources will be de-allocated only
> > > > +		 * when the pool is destroyed. The acomp_ctx will hold on to
> > > the
> > > > +		 * resources through CPU offlining/onlining at any time until
> > > > +		 * the pool is destroyed.
> > > > +		 *
> > > > +		 * This prevents races/deadlocks between reclaim and CPU
> > > acomp_ctx
> > > > +		 * resource allocation that are a dependency for reclaim.
> > > > +		 * It further simplifies the interaction with CPU onlining and
> > > > +		 * offlining:
> > > > +		 *
> > > > +		 * - CPU onlining does not take the mutex. It only allocates
> > > > +		 *   resources and sets __online to true.
> > > > +		 * - CPU offlining acquires the mutex before setting
> > > > +		 *   __online to false. If reclaim has acquired the mutex,
> > > > +		 *   offlining will have to wait for reclaim to complete before
> > > > +		 *   hotunplug can proceed. Further, hotplug merely sets
> > > > +		 *   __online to false. It does not delete the acomp_ctx
> > > > +		 *   resources.
> > > > +		 *
> > > > +		 * Option 1 is better than potentially not exiting the earlier
> > > > +		 * for (;;) loop because the system is running low on memory
> > > > +		 * and/or CPUs are getting offlined for whatever reason. At
> > > > +		 * least failing this store will prevent data loss by failing
> > > > +		 * zswap_store(), and saving the data in the backing swap
> > > device.
> > > >  		 */
> > >
> > > I believe we can dropped. I don't think we can have any store/load
> > > operations on a CPU before it's fully onlined, and we should always have
> > > a reference on the pool here, so the resources cannot go away.
> > >
> > > So unless I missed something we can drop this completely now and just
> > > hold the mutex directly in the load/store paths.
> >
> > Based on the above explanations, please let me know if it is a good idea
> > to keep the __online, or if you think further simplification is possible.
> 
> I still think it's not needed. Let me know if I missed anything.

Let me think some more about whether it is feasible to not have cpuhotplug
manage the acomp_ctx resource allocation, and instead have this be done
through the pool creation/deletion routines.

Thanks,
Kanchana