RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.

"Sridhar, Kanchana P" <kanchana.p.sridhar@xxxxxxxxx> · Tue, 18 Mar 2025 21:09:05 +0000

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> Sent: Tuesday, March 18, 2025 12:06 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
> usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx;
> ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx;
> davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx;
> ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C
> <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>;
> Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Tue, Mar 18, 2025 at 05:38:49PM +0000, Sridhar, Kanchana P wrote:
> [..]
> > > > Thanks Yosry, for this observation! You are right, when considered purely
> > > > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
> > > > zswap_cpu_comp_dead() in fact run on a control CPU, because the state
> is
> > > > registered in the PREPARE section of "enum cpuhp_state" in
> cpuhotplug.h.
> > > >
> > > > The problem however is that, in the current architecture, CPU onlining/
> > > > zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> > > > same semantics as far as these resources are concerned. Hence,
> although
> > > > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> > > > the "hotplug" code is called is in fact online. It is possible for the memory
> > > > allocation calls in zswap_cpu_comp_prepare() to recurse into
> > > > zswap_compress(), which now needs to be handled by the current pool,
> > > > since the new pool has not yet been added to the zswap_pools, as you
> > > > pointed out.
> > > >
> > > > The ref on the current pool has not yet been dropped. Could there be
> > > > a potential for a deadlock at pool transition time: the new pool is
> blocked
> > > > from allocating acomp_ctx resources, triggering reclaim, which the old
> > > > pool needs to handle?
> > >
> > > I am not sure how this could lead to a deadlock. The compression will be
> > > happening in a different pool with a different acomp_ctx.
> >
> > I was thinking about this from the perspective of comparing the trade-offs
> > between these two approaches:
> > a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs.
> > b) Allocating acomp_ctx resources once upfront.
> >
> > With (a), when the user switches zswap to use a new compressor, it is
> possible
> > that the system is already in a low memory situation and the CPU could be
> doing
> > a lot of swapouts. It occurred to me that in theory, the call to switch the
> > compressor through the sysfs interface could never return if the acomp_ctx
> > allocations trigger direct reclaim in this scenario. This was in the context of
> > exploring if a better design is possible, while acknowledging that this could
> still
> > happen today.
> 
> If the system is already in a low memory situation a lot of things will
> hang. Switching the compressor is not a common operation at all and we
> shouldn't really worry about that. Even if we remove the acomp_ctx
> allocation, we still need to make some allocations in that path anyway.

Ok, these are good points.

> 
> >
> > With (b), this situation is avoided by design, and we can switch to a new pool
> > without triggering additional reclaim. Sorry, I should have articulated this
> better.
> 
> But we have to either allocate more memory unnecessarily or add config
> options and make batching a build-time decision. This is unwarranted
> imo.
> 
> FWIW, the mutexes and buffers used to be per-CPU not per-acomp_ctx, but
> they were changed in commit 8ba2f844f050 ("mm/zswap: change per-cpu
> mutex and buffer to per-acomp_ctx"). What you're suggesting is not quite
> the same as what we had before that commit, it's moving the acomp_ctx
> itself to be per-CPU but not per-pool, including the mtuex and buffer.
> But I thought the context may be useful.

Thanks for sharing the context.

> 
> [..]
> > > > 7) To address the issue of how many reqs/buffers to allocate, there could
> > > >      potentially be a memory cost for non-batching compressors, if we
> want
> > > >      to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and
> buffers.
> > > >      This would allow the acomp_ctx to seamlessly handle batching
> > > >      compressors, non-batching compressors, and transitions among the
> > > >      two compressor types in a pretty general manner, that relies only on
> > > >      the ZSWAP_MAX_BATCH_SIZE, which we define anyway.
> > > >
> > > >      I believe we can maximize the chances of success for the allocation of
> > > >      the acomp_ctx resources if this is done in zswap_setup(), but please
> > > >      correct me if I am wrong.
> > > >
> > > >      The added memory cost for platforms without IAA would be
> > > >      ~57 KB per cpu, on x86. Would this be acceptable?
> > >
> > > I think that's a lot of memory to waste per-CPU, and I don't see a good
> > > reason for it.
> >
> > Yes, it appears so. Towards trying to see if a better design is possible
> > by de-coupling the acomp_ctx from zswap_pool:
> > Would this cost be acceptable if it is incurred based on a build config
> > option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default
> OFF)?
> > If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE
> > acomp_ctx resources once, during zswap_setup(). If not, we allocate
> > only one req/buffer in the global percpu acomp_ctx?
> 
> We should avoid making batching a build time decision if we can help it.
> A lot of kernels are shipped to different hardware that may or may not
> support batching, so users will have to either decide to turn off
> batching completely or eat the overhead even for hardware that does not
> support batching (or for users that use SW compression).
> 
> [..]
> > > >
> > > >     The only other fallback solution in lieu of the proposed simplification
> that
> > > >     I can think of is to keep the lifespan of these resources from pool
> creation
> > > >     to deletion, using the CPU hotplug code. Although, it is not totally
> clear
> > > >     to me if there is potential for deadlock during pool transitions, as
> noted
> > > above.
> > >
> > > I am not sure what's the deadlock scenario you're worried about, please
> > > clarify.
> >
> > My apologies: I was referring to triggering direct reclaim during pool
> creation,
> > which could, in theory, run into a scenario where the pool switching would
> > have to wait for reclaim to free up enough memory for the acomp_ctx
> > resources allocation: this could cause the system to hang, but not a
> deadlock.
> > This can happen even today, hence trying to see if a better design is
> possible.
> 
> I think the concern here is unfounded. We shouldn't care about the
> performance of zswap compressor switching, especially under memory
> pressure. A lot of things will slow down under memory pressure, and
> compressor switching should be the least of our concerns.

Sounds good. It then appears that making the per-cpu acomp_ctx resources'
lifetime track that of the zswap_pool, is the way to go. These resources
will be allocated as per the requirements of the compressor, i.e., zswap_pool,
and will persist through CPU online/offline transitions through the hotplug
interface. If this plan is acceptable, it appears that acomp_ctx_get_cpu_lock()
and acomp_ctx_put_unlock() can be replaced with mutex_lock()/unlock() in
zswap_[de]compress()? I will incorporate these changes in v9 if this sounds Ok.

Thanks,
Kanchana