> -----Original Message----- > From: Nhat Pham <nphamcs@xxxxxxxxx> > Sent: Tuesday, September 24, 2024 1:51 PM > To: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux- > kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; hannes@xxxxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> > wrote: > > > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > > > + * The cgroup zswap limit check is done once at the beginning of an > > > + * mTHP store, and not within zswap_store_page() for each page > > > + * in the mTHP. We do however check the zswap pool limits at the > > > + * start of zswap_store_page(). What this means is, the cgroup > > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > > + * However, the per-store-page zswap pool limits check should > > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > > + * reclaim implemented in the shrinker. If this assumption holds, > > > + * the cgroup exceeding the zswap limits could potentially be > > > + * resolved before the next zswap_store, and if it is not, the next > > > + * zswap_store would fail the cgroup zswap limit check at the start. > > > + */ > > > > I do not really like this. Allowing going one page above the limit is > > one thing, but one THP above the limit seems too much. I also don't > > Hmm what if you have multiple concurrent zswap stores, from different > tasks but the same cgroup? If none of them has charged, they would all > get greenlit, and charge towards the cgroup... > > So technically the zswap limit checking is already best-effort only. > But now, instead of one page per violation, it's 512 pages per > violation :) > > Yeah this can be bad. I think this is only safe if you only use > zswap.max as a binary knob (0 or max)... > > > like relying on the repeated limit checking in zswap_store_page(), if > > anything I think that should be batched too. > > > > Is it too unreasonable to maintain the average compression ratio and > > use that to estimate limit checking for both memcg and global limits? > > Johannes, Nhat, any thoughts on this? > > I remember asking about this, but past Nhat might have relented :) > > https://lore.kernel.org/linux- > mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail > .gmail.com/ > > We can do limit checking and charging after compression is done, but > that's a lot of code change (might not even be possible)... It will, > however, allow us to do charging + checking in one go (rather than > doing it 8, 16, or 512 times) > > Another thing we can do is to register a zswap writeback after the > zswap store attempts to clean up excess capacity. Not sure what will > happen if zswap writeback is disabled for the cgroup though :) > > If it's too hard, the average estimate could be a decent compromise, > until we figure something smarter. Thanks Yosry and Nhat for these insights. This is how I was viewing this scenario: I thought of incrementally (per subpage store) calling zswap_pool_get() and limit checks followed by shrinker invocations in case of error conditions to allow different concurrent stores to make progress, without favoring only one process's mTHP store based on there being enough zpool space available (for e.g. based on compression ratio estimate). Besides simplicity and no added overhead in the regular cases, I was thinking this approach would have minimal impact on the process(es) that see the zswap limit being exceeded, and that this would be better than preemptively checking for the entire mTHP and failing (this could also complicate things where no one makes progress because multiple processes run the batch checks and fail, when realistically one/many could have triggered the shrinker before erroring out, and at least one/few could have made progress). Another potential solution for this could be based on experimental data for a given setup, on mTHP swapout failures and say, reducing the zswap zpool max-limit and/or acceptance threshold perhaps? Would appreciate your suggestions on how to proceed as far as the limit checks. Thanks, Kanchana