> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx> > Sent: Tuesday, March 18, 2025 12:06 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux- > crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Tue, Mar 18, 2025 at 05:38:49PM +0000, Sridhar, Kanchana P wrote: > [..] > > > > Thanks Yosry, for this observation! You are right, when considered purely > > > > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and > > > > zswap_cpu_comp_dead() in fact run on a control CPU, because the state > is > > > > registered in the PREPARE section of "enum cpuhp_state" in > cpuhotplug.h. > > > > > > > > The problem however is that, in the current architecture, CPU onlining/ > > > > zswap_pool creation, and CPU offlining/zswap_pool deletion have the > > > > same semantics as far as these resources are concerned. Hence, > although > > > > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which > > > > the "hotplug" code is called is in fact online. It is possible for the memory > > > > allocation calls in zswap_cpu_comp_prepare() to recurse into > > > > zswap_compress(), which now needs to be handled by the current pool, > > > > since the new pool has not yet been added to the zswap_pools, as you > > > > pointed out. > > > > > > > > The ref on the current pool has not yet been dropped. Could there be > > > > a potential for a deadlock at pool transition time: the new pool is > blocked > > > > from allocating acomp_ctx resources, triggering reclaim, which the old > > > > pool needs to handle? > > > > > > I am not sure how this could lead to a deadlock. The compression will be > > > happening in a different pool with a different acomp_ctx. > > > > I was thinking about this from the perspective of comparing the trade-offs > > between these two approaches: > > a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs. > > b) Allocating acomp_ctx resources once upfront. > > > > With (a), when the user switches zswap to use a new compressor, it is > possible > > that the system is already in a low memory situation and the CPU could be > doing > > a lot of swapouts. It occurred to me that in theory, the call to switch the > > compressor through the sysfs interface could never return if the acomp_ctx > > allocations trigger direct reclaim in this scenario. This was in the context of > > exploring if a better design is possible, while acknowledging that this could > still > > happen today. > > If the system is already in a low memory situation a lot of things will > hang. Switching the compressor is not a common operation at all and we > shouldn't really worry about that. Even if we remove the acomp_ctx > allocation, we still need to make some allocations in that path anyway. Ok, these are good points. > > > > > With (b), this situation is avoided by design, and we can switch to a new pool > > without triggering additional reclaim. Sorry, I should have articulated this > better. > > But we have to either allocate more memory unnecessarily or add config > options and make batching a build-time decision. This is unwarranted > imo. > > FWIW, the mutexes and buffers used to be per-CPU not per-acomp_ctx, but > they were changed in commit 8ba2f844f050 ("mm/zswap: change per-cpu > mutex and buffer to per-acomp_ctx"). What you're suggesting is not quite > the same as what we had before that commit, it's moving the acomp_ctx > itself to be per-CPU but not per-pool, including the mtuex and buffer. > But I thought the context may be useful. Thanks for sharing the context. > > [..] > > > > 7) To address the issue of how many reqs/buffers to allocate, there could > > > > potentially be a memory cost for non-batching compressors, if we > want > > > > to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and > buffers. > > > > This would allow the acomp_ctx to seamlessly handle batching > > > > compressors, non-batching compressors, and transitions among the > > > > two compressor types in a pretty general manner, that relies only on > > > > the ZSWAP_MAX_BATCH_SIZE, which we define anyway. > > > > > > > > I believe we can maximize the chances of success for the allocation of > > > > the acomp_ctx resources if this is done in zswap_setup(), but please > > > > correct me if I am wrong. > > > > > > > > The added memory cost for platforms without IAA would be > > > > ~57 KB per cpu, on x86. Would this be acceptable? > > > > > > I think that's a lot of memory to waste per-CPU, and I don't see a good > > > reason for it. > > > > Yes, it appears so. Towards trying to see if a better design is possible > > by de-coupling the acomp_ctx from zswap_pool: > > Would this cost be acceptable if it is incurred based on a build config > > option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default > OFF)? > > If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE > > acomp_ctx resources once, during zswap_setup(). If not, we allocate > > only one req/buffer in the global percpu acomp_ctx? > > We should avoid making batching a build time decision if we can help it. > A lot of kernels are shipped to different hardware that may or may not > support batching, so users will have to either decide to turn off > batching completely or eat the overhead even for hardware that does not > support batching (or for users that use SW compression). > > [..] > > > > > > > > The only other fallback solution in lieu of the proposed simplification > that > > > > I can think of is to keep the lifespan of these resources from pool > creation > > > > to deletion, using the CPU hotplug code. Although, it is not totally > clear > > > > to me if there is potential for deadlock during pool transitions, as > noted > > > above. > > > > > > I am not sure what's the deadlock scenario you're worried about, please > > > clarify. > > > > My apologies: I was referring to triggering direct reclaim during pool > creation, > > which could, in theory, run into a scenario where the pool switching would > > have to wait for reclaim to free up enough memory for the acomp_ctx > > resources allocation: this could cause the system to hang, but not a > deadlock. > > This can happen even today, hence trying to see if a better design is > possible. > > I think the concern here is unfounded. We shouldn't care about the > performance of zswap compressor switching, especially under memory > pressure. A lot of things will slow down under memory pressure, and > compressor switching should be the least of our concerns. Sounds good. It then appears that making the per-cpu acomp_ctx resources' lifetime track that of the zswap_pool, is the way to go. These resources will be allocated as per the requirements of the compressor, i.e., zswap_pool, and will persist through CPU online/offline transitions through the hotplug interface. If this plan is acceptable, it appears that acomp_ctx_get_cpu_lock() and acomp_ctx_put_unlock() can be replaced with mutex_lock()/unlock() in zswap_[de]compress()? I will incorporate these changes in v9 if this sounds Ok. Thanks, Kanchana