Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().

Johannes Weiner <hannes@xxxxxxxxxxx> · Wed, 25 Sep 2024 15:20:06 -0400

On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> Johannes wrote:
> > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > scheme: check the limit up front for fast-path failure if we're
> > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > atomically charge against zswap.max and unwind the store if we raced.
> >
> > For now, I would just keep the simple version we currently have: check
> > once in zswap_store() and then just go ahead for the whole folio.
> 
> I am not totally against this but I feel like this is too optimistic.
> I think we can keep it simple-ish by maintaining an ewma for the
> compression ratio, we already have primitives for this (see
> DECLARE_EWMA).
> 
> Then in zswap_store(), we can use the ewma to estimate the compressed
> size and use it to do the memcg and global limit checks once, like we
> do today. Instead of just checking if we are below the limits, we
> check if we have enough headroom for the estimated compressed size.
> Then we call zswap_store_page() to do the per-page stuff, then do
> batched charging and stats updates.

I'm not sure what you gain from making a non-atomic check precise. You
can get a hundred threads determining down precisely that *their*
store will fit exactly into the last 800kB before the limit.

> If you think that's an overkill we can keep doing the limit checks as
> we do today,

I just don't see how it would make a practical difference.

What would make a difference is atomic transactional charging of the
compressed size, and unwinding on failure - with the upfront check to
avoid pointlessly compressing (outside of race conditions).

And I'm not against doing that in general, I am just against doing it
per default.

It's a lot of complexity, and like I said, the practical usecase for
limiting zswap memory to begin with is quite unclear to me. Zswap is
not a limited resource. It's just memory. And you already had the
memory for the uncompressed copy. So it's a bit strange to me to say
"you have compressed your memory enough, so now you get sent to disk
(or we declare OOM)". What would be a reason to limit it?

It sort of makes sense as a binary switch, but I don't get the usecase
for a granular limit. (And I blame my own cowardice for making the
cgroup knob a limit, to keep options open, instead of a switch.)

All that to say, this would be better in a follow-up patch. We allow
overshooting now, it's not clear how overshooting by a larger amount
makes a categorical difference.

> but I would still like to see batching of all the limit checks,
> charging, and stats updates. It makes little sense otherwise.

Definitely. One check, one charge, one stat update per folio.