> -----Original Message----- > From: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > Sent: Wednesday, September 25, 2024 11:31 AM > To: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>; linux- > kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; nphamcs@xxxxxxxxx; > chengming.zhou@xxxxxxxxx; usamaarif642@xxxxxxxxx; > shakeel.butt@xxxxxxxxx; ryan.roberts@xxxxxxx; Huang, Ying > <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; > Zou, Nanhai <nanhai.zou@xxxxxxxxx>; Feghali, Wajdi K > <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > [..] > > > > + /* > > > > + * Check cgroup limits: > > > > + * > > > > + * The cgroup zswap limit check is done once at the beginning of an > > > > + * mTHP store, and not within zswap_store_page() for each page > > > > + * in the mTHP. We do however check the zswap pool limits at the > > > > + * start of zswap_store_page(). What this means is, the cgroup > > > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > > > + * However, the per-store-page zswap pool limits check should > > > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > > > + * reclaim implemented in the shrinker. If this assumption holds, > > > > + * the cgroup exceeding the zswap limits could potentially be > > > > + * resolved before the next zswap_store, and if it is not, the next > > > > + * zswap_store would fail the cgroup zswap limit check at the start. > > > > + */ > > > > > > I do not really like this. Allowing going one page above the limit is > > > one thing, but one THP above the limit seems too much. I also don't > > > like relying on the repeated limit checking in zswap_store_page(), if > > > anything I think that should be batched too. > > > > > > Is it too unreasonable to maintain the average compression ratio and > > > use that to estimate limit checking for both memcg and global limits? > > > Johannes, Nhat, any thoughts on this? > > > > I honestly don't think it's much of an issue. The global limit is > > huge, and the cgroup limit is to the best of my knowledge only used as > > a binary switch. Setting a non-binary limit - global or cgroup - seems > > like a bit of an obscure usecase to me, because in the vast majority > > of cases it's preferable to keep compresing over declaring OOM. > > > > And even if you do have some granular limit, the workload size scales > > with it. It's not like you have a thousand THPs in a 10M cgroup. > > The memcg limit and zswap limit can be disproportionate, although that > shouldn't be common. > > > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > > scheme: check the limit up front for fast-path failure if we're > > already maxed out, just like now; then make obj_cgroup_charge_zswap() > > atomically charge against zswap.max and unwind the store if we raced. > > > > For now, I would just keep the simple version we currently have: check > > once in zswap_store() and then just go ahead for the whole folio. > > I am not totally against this but I feel like this is too optimistic. > I think we can keep it simple-ish by maintaining an ewma for the > compression ratio, we already have primitives for this (see > DECLARE_EWMA). > > Then in zswap_store(), we can use the ewma to estimate the compressed > size and use it to do the memcg and global limit checks once, like we > do today. Instead of just checking if we are below the limits, we > check if we have enough headroom for the estimated compressed size. > Then we call zswap_store_page() to do the per-page stuff, then do > batched charging and stats updates. > > If you think that's an overkill we can keep doing the limit checks as > we do today, > but I would still like to see batching of all the limit checks, > charging, and stats updates. It makes little sense otherwise. Thanks Johannes and Yosry for these suggestions and pointers. I think there is general agreement about the batch charging and zswap_stored_pages/stats updates. Yosry, does "batching of limit checks" imply the same as a simple check for being over the cgroup limit at the start of zswap_store and not doing this check in zswap_store_page? Does this also imply a zswap_pool_get_many()? Would appreciate it if you can help clarify. The main question in my mind about using the EWMA checks is, will it add overhead to the normal zswap reclaim path; and if so, would a simple limit check at the start of zswap_store as suggested by Johannes suffice. I can run a few experiments to quantify this overhead, and maybe we can revisit this? Thanks, Kanchana