RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Tue, 18 Mar 2025 17:38:49 +0000

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> Sent: Tuesday, March 18, 2025 7:24 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>;
> Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Mon, Mar 17, 2025 at 09:15:09PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> > > Sent: Monday, March 10, 2025 10:31 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx;
> chengming.zhou@xxxxxxxxx;
> > > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> > > ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux-
> > > crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> > > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K
> <wajdi.k.feghali@xxxxxxxxx>;
> > > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> > > allocation/deletion and mutex lock usage.
> > >
> > > On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > [..]
> > > > > > > >  	u8 *buffer;
> > > > > > > > +	u8 nr_reqs;
> > > > > > > > +	struct crypto_wait wait;
> > > > > > > >  	struct mutex mutex;
> > > > > > > >  	bool is_sleepable;
> > > > > > > > +	bool __online;
> > > > > > >
> > > > > > > I don't believe we need this.
> > > > > > >
> > > > > > > If we are not freeing resources during CPU offlining, then we do
> not
> > > > > > > need a CPU offline callback and acomp_ctx->__online serves no
> > > purpose.
> > > > > > >
> > > > > > > The whole point of synchronizing between offlining and
> > > > > > > compress/decompress operations is to avoid UAF. If offlining does
> not
> > > > > > > free resources, then we can hold the mutex directly in the
> > > > > > > compress/decompress path and drop the hotunplug callback
> > > completely.
> > > > > > >
> > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems
> like
> > > > > > > it's only used know when to set __online.
> > > > > >
> > > > > > All great points! In fact, that was the original solution I had
> implemented
> > > > > > (not having an offline callback). But then, I spent some time
> > > understanding
> > > > > > the v6.13 hotfix for synchronizing freeing of resources, and this
> comment
> > > > > > in zswap_cpu_comp_prepare():
> > > > > >
> > > > > > 	/*
> > > > > > 	 * Only hold the mutex after completing allocations,
> otherwise we
> > > > > may
> > > > > > 	 * recurse into zswap through reclaim and attempt to hold the
> mutex
> > > > > > 	 * again resulting in a deadlock.
> > > > > > 	 */
> > > > > >
> > > > > > Hence, I figured the constraint of "recurse into zswap through
> reclaim"
> > > was
> > > > > > something to comprehend in the simplification (even though I had a
> > > tough
> > > > > > time imagining how this could happen).
> > > > >
> > > > > The constraint here is about zswap_cpu_comp_prepare() holding the
> > > mutex,
> > > > > making an allocation which internally triggers reclaim, then recursing
> > > > > into zswap and trying to hold the same mutex again causing a
> deadlock.
> > > > >
> > > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to
> begin
> > > > > with, the constraint naturally goes away.
> > > >
> > > > Actually, if it is possible for the allocations in
> zswap_cpu_comp_prepare()
> > > > to trigger reclaim, then I believe we need some way for reclaim to know
> if
> > > > the acomp_ctx resources are available. Hence, this seems like a
> potential
> > > > for deadlock regardless of the mutex.
> > >
> > > I took a closer look and I believe my hotfix was actually unnecessary. I
> > > sent it out in response to a syzbot report, but upon closer look it
> > > seems like it was not an actual problem. Sorry if my patch confused you.
> > >
> > > Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like
> > > CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The
> comment
> > > above
> > > says:
> > >
> > >  * PREPARE: The callbacks are invoked on a control CPU before the
> > >  * hotplugged CPU is started up or after the hotplugged CPU has died.
> > >
> > > So even if we go into reclaim during zswap_cpu_comp_prepare(), it will
> > > never be on the CPU that we are allocating resources for.
> > >
> > > The other case where zswap_cpu_comp_prepare() could race with
> > > compression/decompression is when a pool is being created. In this case,
> > > reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the
> > > same
> > > CPU AFAICT. However, because the pool is still under creation, it will
> > > not be used (i.e. zswap_pool_current_get() won't find it).
> > >
> > > So I think we don't need to worry about zswap_cpu_comp_prepare()
> racing
> > > with compression or decompression for the same pool and CPU.
> >
> > Thanks Yosry, for this observation! You are right, when considered purely
> > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
> > zswap_cpu_comp_dead() in fact run on a control CPU, because the state is
> > registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h.
> >
> > The problem however is that, in the current architecture, CPU onlining/
> > zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> > same semantics as far as these resources are concerned. Hence, although
> > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> > the "hotplug" code is called is in fact online. It is possible for the memory
> > allocation calls in zswap_cpu_comp_prepare() to recurse into
> > zswap_compress(), which now needs to be handled by the current pool,
> > since the new pool has not yet been added to the zswap_pools, as you
> > pointed out.
> >
> > The ref on the current pool has not yet been dropped. Could there be
> > a potential for a deadlock at pool transition time: the new pool is blocked
> > from allocating acomp_ctx resources, triggering reclaim, which the old
> > pool needs to handle?
> 
> I am not sure how this could lead to a deadlock. The compression will be
> happening in a different pool with a different acomp_ctx.

I was thinking about this from the perspective of comparing the trade-offs
between these two approaches:
a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs.
b) Allocating acomp_ctx resources once upfront.

With (a), when the user switches zswap to use a new compressor, it is possible
that the system is already in a low memory situation and the CPU could be doing
a lot of swapouts. It occurred to me that in theory, the call to switch the
compressor through the sysfs interface could never return if the acomp_ctx
allocations trigger direct reclaim in this scenario. This was in the context of
exploring if a better design is possible, while acknowledging that this could still
happen today.

With (b), this situation is avoided by design, and we can switch to a new pool
without triggering additional reclaim. Sorry, I should have articulated this better.

> 
> >
> > I see other places in the kernel that use CPU hotplug for resource allocation,
> > outside of the context of CPU onlining. IIUC, it is difficult to guarantee that
> > the startup/teardown callbacks are modifying acomp_ctx resources for a
> > dysfunctional CPU.
> 
> IIUC, outside the context of CPU onlining, CPU hotplug callbacks get
> called when they are added. In this case, only the newly added callbacks
> will be executed. IOW, zswap's hotplug callback should not be randomly
> getting called when irrelevant code adds hotplug callbacks. It should
> only happen during zswap pool initialization or CPU onlining.
> 
> Please correct me if I am wrong.

Yes, this is my understanding as well.

> 
> >
> > Now that I think about it, the only real constraint is that the acomp_ctx
> > resources are guaranteed to exist for a functional CPU which can run zswap
> > compress/decompress.
> 
> I believe this is already the case as I previously described, because
> the hotplug callback can only be called in two scenarios:
> - Zswap pool initialization, in which case compress/decompress
>   operations cannot run on the pool we are initializing.
> - CPU onlining, in which case compress/decompress operations cannot run
>   on the CPU we are onlining.
> 
> Please correct me if I am wrong.

Agreed, this is my understanding too.

> 
> >
> > I think we can simplify this as follows, and would welcome suggestions
> > to improve the proposed solution:
> >
> > 1) We dis-associate the acomp_ctx from the pool, and instead, have this
> >     be a global percpu zswap resource that gets allocated once in
> zswap_setup(),
> >     just like the zswap_entry_cache.
> > 2) The acomp_ctx resources will get allocated during zswap_setup(), using
> >     the cpuhp_setup_state_multi callback() in zswap_setup(), that registers
> >     zswap_cpu_comp_prepare(), but no teardown callback.
> > 3) We call cpuhp_state_add_instance() for_each_possible_cpu(cpu) in
> >      zswap_setup().
> > 4) The acomp_ctx resources persist through subsequent "real CPU
> offline/online
> >      state transitions".
> > 5) zswap_[de]compress() can go ahead and lock the mutex, and use the
> >     reqs/buffers without worrying about whether these resources are
> >     initialized or still exist/are being deleted.
> > 6) "struct zswap_pool" is now de-coupled from this global percpu zswap
> >     acomp_ctx.
> > 7) To address the issue of how many reqs/buffers to allocate, there could
> >      potentially be a memory cost for non-batching compressors, if we want
> >      to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers.
> >      This would allow the acomp_ctx to seamlessly handle batching
> >      compressors, non-batching compressors, and transitions among the
> >      two compressor types in a pretty general manner, that relies only on
> >      the ZSWAP_MAX_BATCH_SIZE, which we define anyway.
> >
> >      I believe we can maximize the chances of success for the allocation of
> >      the acomp_ctx resources if this is done in zswap_setup(), but please
> >      correct me if I am wrong.
> >
> >      The added memory cost for platforms without IAA would be
> >      ~57 KB per cpu, on x86. Would this be acceptable?
> 
> I think that's a lot of memory to waste per-CPU, and I don't see a good
> reason for it.

Yes, it appears so. Towards trying to see if a better design is possible
by de-coupling the acomp_ctx from zswap_pool:
Would this cost be acceptable if it is incurred based on a build config
option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default OFF)?
If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE
acomp_ctx resources once, during zswap_setup(). If not, we allocate
only one req/buffer in the global percpu acomp_ctx? 

Since we are comprehending that other compressors may want to
do batching in future, I thought a more general config option name
would be more appropriate.

> 
> >      If not, I don't believe this simplification would be worth it, because
> >      allocating for one req/buffer, then dynamically adding more resources
> >      if a newly selected compressor requires more resources, would run
> >      into the same race conditions and added checks as in
> >      acomp_ctx_get_cpu_lock(), which I believe, seem to be necessary
> because
> >      CPU onlining/zswap_pool creation and CPU offlining/zswap_pool
> >      deletion have the same semantics for these resources.
> 
> Agree that using a single acomp_ctx per-CPU but making the resources
> resizable is not a win.

Yes, this makes sense: resizing is not the way to go.

> 
> >
> >     The only other fallback solution in lieu of the proposed simplification that
> >     I can think of is to keep the lifespan of these resources from pool creation
> >     to deletion, using the CPU hotplug code. Although, it is not totally clear
> >     to me if there is potential for deadlock during pool transitions, as noted
> above.
> 
> I am not sure what's the deadlock scenario you're worried about, please
> clarify.

My apologies: I was referring to triggering direct reclaim during pool creation,
which could, in theory, run into a scenario where the pool switching would
have to wait for reclaim to free up enough memory for the acomp_ctx
resources allocation: this could cause the system to hang, but not a deadlock.
This can happen even today, hence trying to see if a better design is possible.

Thanks,
Kanchana