[..] > > > + /* > > > + * Check cgroup limits: > > > + * > > > + * The cgroup zswap limit check is done once at the beginning of an > > > + * mTHP store, and not within zswap_store_page() for each page > > > + * in the mTHP. We do however check the zswap pool limits at the > > > + * start of zswap_store_page(). What this means is, the cgroup > > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > > + * However, the per-store-page zswap pool limits check should > > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > > + * reclaim implemented in the shrinker. If this assumption holds, > > > + * the cgroup exceeding the zswap limits could potentially be > > > + * resolved before the next zswap_store, and if it is not, the next > > > + * zswap_store would fail the cgroup zswap limit check at the start. > > > + */ > > > > I do not really like this. Allowing going one page above the limit is > > one thing, but one THP above the limit seems too much. I also don't > > like relying on the repeated limit checking in zswap_store_page(), if > > anything I think that should be batched too. > > > > Is it too unreasonable to maintain the average compression ratio and > > use that to estimate limit checking for both memcg and global limits? > > Johannes, Nhat, any thoughts on this? > > I honestly don't think it's much of an issue. The global limit is > huge, and the cgroup limit is to the best of my knowledge only used as > a binary switch. Setting a non-binary limit - global or cgroup - seems > like a bit of an obscure usecase to me, because in the vast majority > of cases it's preferable to keep compresing over declaring OOM. > > And even if you do have some granular limit, the workload size scales > with it. It's not like you have a thousand THPs in a 10M cgroup. The memcg limit and zswap limit can be disproportionate, although that shouldn't be common. > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > scheme: check the limit up front for fast-path failure if we're > already maxed out, just like now; then make obj_cgroup_charge_zswap() > atomically charge against zswap.max and unwind the store if we raced. > > For now, I would just keep the simple version we currently have: check > once in zswap_store() and then just go ahead for the whole folio. I am not totally against this but I feel like this is too optimistic. I think we can keep it simple-ish by maintaining an ewma for the compression ratio, we already have primitives for this (see DECLARE_EWMA). Then in zswap_store(), we can use the ewma to estimate the compressed size and use it to do the memcg and global limit checks once, like we do today. Instead of just checking if we are below the limits, we check if we have enough headroom for the estimated compressed size. Then we call zswap_store_page() to do the per-page stuff, then do batched charging and stats updates. If you think that's an overkill we can keep doing the limit checks as we do today, but I would still like to see batching of all the limit checks, charging, and stats updates. It makes little sense otherwise.