> -----Original Message----- > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Sent: Thursday, September 26, 2024 10:35 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; > linux-mm@xxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > usamaarif642@xxxxxxxxx; shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Thu, Sep 26, 2024 at 10:29 AM Sridhar, Kanchana P > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > > Sent: Thursday, September 26, 2024 10:20 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > > Cc: Johannes Weiner <hannes@xxxxxxxxxxx>; linux- > kernel@xxxxxxxxxxxxxxx; > > > linux-mm@xxxxxxxxx; nphamcs@xxxxxxxxx; chengming.zhou@xxxxxxxxx; > > > usamaarif642@xxxxxxxxx; shakeel.butt@xxxxxxxxx; > ryan.roberts@xxxxxxx; > > > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux- > > > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > > > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > > zswap_store(). > > > > > > On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P > > > <kanchana.p.sridhar@xxxxxxxxx> wrote: > > > > > > > > > -----Original Message----- > > > > > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > > > > Sent: Wednesday, September 25, 2024 9:52 PM > > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx> > > > > > Cc: Johannes Weiner <hannes@xxxxxxxxxxx>; linux- > > > kernel@xxxxxxxxxxxxxxx; > > > > > linux-mm@xxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; > > > > > usamaarif642@xxxxxxxxx; shakeel.butt@xxxxxxxxx; > > > ryan.roberts@xxxxxxx; > > > > > Huang, Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; > akpm@linux- > > > > > foundation.org; Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > > > > > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh > <vinodh.gopal@xxxxxxxxx> > > > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > > > > zswap_store(). > > > > > > > > > > [..] > > > > > > > > > > > > One thing I realized while reworking the patches for the batched > checks > > > is: > > > > > > within zswap_store_page(), we set the entry->objcg and entry->pool > > > before > > > > > > adding it to the xarray. Given this, wouldn't it be safer to get the > objcg > > > > > > and pool reference per sub-page, locally in zswap_store_page(), > rather > > > than > > > > > > obtaining batched references at the end if the store is successful? If > we > > > > > want > > > > > > zswap_store_page() to be self-contained and correct as far as the > entry > > > > > > being created and added to the xarray, it seems like the right thing to > > > do? > > > > > > I am a bit apprehensive about the entry being added to the xarray > > > without > > > > > > a reference obtained on the objcg and pool, because any page- > > > > > faults/writeback > > > > > > that occur on sub-pages added to the xarray before the entire folio > has > > > been > > > > > > stored, would run into issues. > > > > > > > > > > We definitely should not obtain references to the pool and objcg after > > > > > initializing the entries with them. We can obtain all references in > > > > > zswap_store() before zswap_store_page(). IOW, the batching in this > > > > > case should be done before the per-page operations, not after. > > > > > > > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to > the > > > > zswap_pool at the start of zswap_store. > > > > > > > > In the case of error on any sub-page, we will unwind state for potentially > > > > only the stored pages or the entire folio if it happened to already be in > > > zswap > > > > and is being re-written. We might need some additional book-keeping to > > > > keep track of which sub-pages were found in the xarray and > > > zswap_entry_free() > > > > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I > > > would need > > > > to call this with (folio_nr_pages() - nr_sb). > > > > > > > > As far as zswap_pool_get(), there is some added complexity if we want > to > > > > keep the existing implementation that calls "percpu_ref_tryget()", and > > > assuming > > > > this is extended to provide a new "zswap_pool_get_many()" that calls > > > > "percpu_ref_tryget_many()". Is there a reason we use > percpu_ref_tryget() > > > instead > > > > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the > > > pool->ref > > > > is 0, no further increments will be made. If so, upon unwinding state in > > > > zswap_store(), I would need to special-case to catch this before calling a > > > new > > > > "zswap_pool_put_many()". > > > > > > > > Things could be a little simpler if zswap_pool_get() can use > > > "percpu_ref_get()" > > > > which will always increment the refcount. Since the zswap pool->ref is > > > initialized > > > > to "1", this seems Ok, but I don't know if there will be unintended > > > consequences. > > > > > > > > Can you please advise on what is the simplest/cleanest approach: > > > > > > > > 1) Proceed with the above changes without changing percpu_ref_tryget > in > > > > zswap_pool_get. Needs special-casing in zswap_store to detect pool- > > > >ref > > > > being "0" before calling zswap_pool_put[_many]. > > > > > > My assumption is that we can reorder the code such that if > > > zswap_pool_get_many() fails we don't call zswap_pool_put_many() to > > > begin with (e.g. jump to a label after zswap_pool_put_many()). > > > > However, the pool refcount could change between the start and end of > > zswap_store. > > I am not sure what you mean. If zswap_pool_get_many() fails then we > just do not call zswap_pool_put_many() at all and abort. I guess I was thinking of a scenario where zswap_pool_get_many() returns true; subsequently, the pool refcount reaches 0 before the zswap_pool_put_many(). I just realized this shouldn’t happen, so I think we are Ok. Will think about this some more while creating the follow-up patch. > > > > > > > > > > 2) Modify zswap_pool_get/zswap_pool_get_many to use > > > percpu_ref_get_many > > > > and avoid special-casing to detect pool->ref being "0" before calling > > > > zswap_pool_put[_many]. > > > > > > I don't think we can simply switch the tryget to a get, as I believe > > > we can race with the pool being destroyed. > > > > That was my initial thought as well, but I figured this couldn't happen > > since the pool->ref is initialized to "1", and based on the existing > > implementation. In any case, I can understand the intent of the use > > of "tryget"; it is just that it adds to the considerations for reference > > batching. > > The initial ref can be dropped in __zswap_param_set() if a new pool is > created (see the call to ercpu_ref_kill(()). I see.. this makes sense, thanks Yosry! > > > > > > > > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > > > > zswap_store_page for both success and error conditions, and any > > > unwinding > > > > state in zswap_store will take care of dropping references obtained > from > > > > prior successful writes (from this or prior invocations of zswap_store). > > > > > > I am also fine with doing that and doing the reference batching as a follow > up. > > > > I think so too! We could try and improve upon (3) with reference batching > > in a follow-up patch. > > SGTM. Thanks, will proceed!