> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx> > Sent: Tuesday, March 18, 2025 7:24 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux- > crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > allocation/deletion and mutex lock usage. > > On Mon, Mar 17, 2025 at 09:15:09PM +0000, Sridhar, Kanchana P wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosry.ahmed@xxxxxxxxx> > > > Sent: Monday, March 10, 2025 10:31 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; > > > hannes@xxxxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; > > > usamaarif642@xxxxxxxxx; ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; > > > ying.huang@xxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux- > > > crypto@xxxxxxxxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; > > > davem@xxxxxxxxxxxxx; clabbe@xxxxxxxxxxxx; ardb@xxxxxxxxxx; > > > ebiggers@xxxxxxxxxx; surenb@xxxxxxxxxx; Accardi, Kristen C > > > <kristen.c.accardi@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; > > > Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource > > > allocation/deletion and mutex lock usage. > > > > > > On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote: > > > > > > > [..] > > > > > > > > u8 *buffer; > > > > > > > > + u8 nr_reqs; > > > > > > > > + struct crypto_wait wait; > > > > > > > > struct mutex mutex; > > > > > > > > bool is_sleepable; > > > > > > > > + bool __online; > > > > > > > > > > > > > > I don't believe we need this. > > > > > > > > > > > > > > If we are not freeing resources during CPU offlining, then we do > not > > > > > > > need a CPU offline callback and acomp_ctx->__online serves no > > > purpose. > > > > > > > > > > > > > > The whole point of synchronizing between offlining and > > > > > > > compress/decompress operations is to avoid UAF. If offlining does > not > > > > > > > free resources, then we can hold the mutex directly in the > > > > > > > compress/decompress path and drop the hotunplug callback > > > completely. > > > > > > > > > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems > like > > > > > > > it's only used know when to set __online. > > > > > > > > > > > > All great points! In fact, that was the original solution I had > implemented > > > > > > (not having an offline callback). But then, I spent some time > > > understanding > > > > > > the v6.13 hotfix for synchronizing freeing of resources, and this > comment > > > > > > in zswap_cpu_comp_prepare(): > > > > > > > > > > > > /* > > > > > > * Only hold the mutex after completing allocations, > otherwise we > > > > > may > > > > > > * recurse into zswap through reclaim and attempt to hold the > mutex > > > > > > * again resulting in a deadlock. > > > > > > */ > > > > > > > > > > > > Hence, I figured the constraint of "recurse into zswap through > reclaim" > > > was > > > > > > something to comprehend in the simplification (even though I had a > > > tough > > > > > > time imagining how this could happen). > > > > > > > > > > The constraint here is about zswap_cpu_comp_prepare() holding the > > > mutex, > > > > > making an allocation which internally triggers reclaim, then recursing > > > > > into zswap and trying to hold the same mutex again causing a > deadlock. > > > > > > > > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to > begin > > > > > with, the constraint naturally goes away. > > > > > > > > Actually, if it is possible for the allocations in > zswap_cpu_comp_prepare() > > > > to trigger reclaim, then I believe we need some way for reclaim to know > if > > > > the acomp_ctx resources are available. Hence, this seems like a > potential > > > > for deadlock regardless of the mutex. > > > > > > I took a closer look and I believe my hotfix was actually unnecessary. I > > > sent it out in response to a syzbot report, but upon closer look it > > > seems like it was not an actual problem. Sorry if my patch confused you. > > > > > > Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like > > > CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The > comment > > > above > > > says: > > > > > > * PREPARE: The callbacks are invoked on a control CPU before the > > > * hotplugged CPU is started up or after the hotplugged CPU has died. > > > > > > So even if we go into reclaim during zswap_cpu_comp_prepare(), it will > > > never be on the CPU that we are allocating resources for. > > > > > > The other case where zswap_cpu_comp_prepare() could race with > > > compression/decompression is when a pool is being created. In this case, > > > reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the > > > same > > > CPU AFAICT. However, because the pool is still under creation, it will > > > not be used (i.e. zswap_pool_current_get() won't find it). > > > > > > So I think we don't need to worry about zswap_cpu_comp_prepare() > racing > > > with compression or decompression for the same pool and CPU. > > > > Thanks Yosry, for this observation! You are right, when considered purely > > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and > > zswap_cpu_comp_dead() in fact run on a control CPU, because the state is > > registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h. > > > > The problem however is that, in the current architecture, CPU onlining/ > > zswap_pool creation, and CPU offlining/zswap_pool deletion have the > > same semantics as far as these resources are concerned. Hence, although > > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which > > the "hotplug" code is called is in fact online. It is possible for the memory > > allocation calls in zswap_cpu_comp_prepare() to recurse into > > zswap_compress(), which now needs to be handled by the current pool, > > since the new pool has not yet been added to the zswap_pools, as you > > pointed out. > > > > The ref on the current pool has not yet been dropped. Could there be > > a potential for a deadlock at pool transition time: the new pool is blocked > > from allocating acomp_ctx resources, triggering reclaim, which the old > > pool needs to handle? > > I am not sure how this could lead to a deadlock. The compression will be > happening in a different pool with a different acomp_ctx. I was thinking about this from the perspective of comparing the trade-offs between these two approaches: a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs. b) Allocating acomp_ctx resources once upfront. With (a), when the user switches zswap to use a new compressor, it is possible that the system is already in a low memory situation and the CPU could be doing a lot of swapouts. It occurred to me that in theory, the call to switch the compressor through the sysfs interface could never return if the acomp_ctx allocations trigger direct reclaim in this scenario. This was in the context of exploring if a better design is possible, while acknowledging that this could still happen today. With (b), this situation is avoided by design, and we can switch to a new pool without triggering additional reclaim. Sorry, I should have articulated this better. > > > > > I see other places in the kernel that use CPU hotplug for resource allocation, > > outside of the context of CPU onlining. IIUC, it is difficult to guarantee that > > the startup/teardown callbacks are modifying acomp_ctx resources for a > > dysfunctional CPU. > > IIUC, outside the context of CPU onlining, CPU hotplug callbacks get > called when they are added. In this case, only the newly added callbacks > will be executed. IOW, zswap's hotplug callback should not be randomly > getting called when irrelevant code adds hotplug callbacks. It should > only happen during zswap pool initialization or CPU onlining. > > Please correct me if I am wrong. Yes, this is my understanding as well. > > > > > Now that I think about it, the only real constraint is that the acomp_ctx > > resources are guaranteed to exist for a functional CPU which can run zswap > > compress/decompress. > > I believe this is already the case as I previously described, because > the hotplug callback can only be called in two scenarios: > - Zswap pool initialization, in which case compress/decompress > operations cannot run on the pool we are initializing. > - CPU onlining, in which case compress/decompress operations cannot run > on the CPU we are onlining. > > Please correct me if I am wrong. Agreed, this is my understanding too. > > > > > I think we can simplify this as follows, and would welcome suggestions > > to improve the proposed solution: > > > > 1) We dis-associate the acomp_ctx from the pool, and instead, have this > > be a global percpu zswap resource that gets allocated once in > zswap_setup(), > > just like the zswap_entry_cache. > > 2) The acomp_ctx resources will get allocated during zswap_setup(), using > > the cpuhp_setup_state_multi callback() in zswap_setup(), that registers > > zswap_cpu_comp_prepare(), but no teardown callback. > > 3) We call cpuhp_state_add_instance() for_each_possible_cpu(cpu) in > > zswap_setup(). > > 4) The acomp_ctx resources persist through subsequent "real CPU > offline/online > > state transitions". > > 5) zswap_[de]compress() can go ahead and lock the mutex, and use the > > reqs/buffers without worrying about whether these resources are > > initialized or still exist/are being deleted. > > 6) "struct zswap_pool" is now de-coupled from this global percpu zswap > > acomp_ctx. > > 7) To address the issue of how many reqs/buffers to allocate, there could > > potentially be a memory cost for non-batching compressors, if we want > > to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers. > > This would allow the acomp_ctx to seamlessly handle batching > > compressors, non-batching compressors, and transitions among the > > two compressor types in a pretty general manner, that relies only on > > the ZSWAP_MAX_BATCH_SIZE, which we define anyway. > > > > I believe we can maximize the chances of success for the allocation of > > the acomp_ctx resources if this is done in zswap_setup(), but please > > correct me if I am wrong. > > > > The added memory cost for platforms without IAA would be > > ~57 KB per cpu, on x86. Would this be acceptable? > > I think that's a lot of memory to waste per-CPU, and I don't see a good > reason for it. Yes, it appears so. Towards trying to see if a better design is possible by de-coupling the acomp_ctx from zswap_pool: Would this cost be acceptable if it is incurred based on a build config option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default OFF)? If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE acomp_ctx resources once, during zswap_setup(). If not, we allocate only one req/buffer in the global percpu acomp_ctx? Since we are comprehending that other compressors may want to do batching in future, I thought a more general config option name would be more appropriate. > > > If not, I don't believe this simplification would be worth it, because > > allocating for one req/buffer, then dynamically adding more resources > > if a newly selected compressor requires more resources, would run > > into the same race conditions and added checks as in > > acomp_ctx_get_cpu_lock(), which I believe, seem to be necessary > because > > CPU onlining/zswap_pool creation and CPU offlining/zswap_pool > > deletion have the same semantics for these resources. > > Agree that using a single acomp_ctx per-CPU but making the resources > resizable is not a win. Yes, this makes sense: resizing is not the way to go. > > > > > The only other fallback solution in lieu of the proposed simplification that > > I can think of is to keep the lifespan of these resources from pool creation > > to deletion, using the CPU hotplug code. Although, it is not totally clear > > to me if there is potential for deadlock during pool transitions, as noted > above. > > I am not sure what's the deadlock scenario you're worried about, please > clarify. My apologies: I was referring to triggering direct reclaim during pool creation, which could, in theory, run into a scenario where the pool switching would have to wait for reclaim to free up enough memory for the acomp_ctx resources allocation: this could cause the system to hang, but not a deadlock. This can happen even today, hence trying to see if a better design is possible. Thanks, Kanchana