Re: [PATCH v2 6/6] zswap: memcg accounting

Johannes Weiner <hannes@xxxxxxxxxxx> · Mon, 16 May 2022 16:01:05 -0400

Hello Michal,

On Mon, May 16, 2022 at 04:34:59PM +0200, Michal Koutný wrote:
> On Fri, May 13, 2022 at 01:08:13PM -0400, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > Right, it's accounted as a subset rather than fully disjointed. But it
> > is a limitable counter of its own, so I exported it as such, with a
> > current and a max knob. This is comparable to the kmem counter in v1.
> 
> That counter and limit didn't turn out well. I liked the analogy to
> writeback (and dirty limit) better.

Right, I was only talking about the design decision to add a usage
knob alongside the limit. The counter failed for a different reason.

> > From an API POV it would be quite strange to have max for a counter
> > that has no current. Likewise it would be strange for a major memory
> > consumer to be missing from memory.stat.
> 
> My understanding would be to have all memory.stat entries as you
> propose, no extra .current counter and the .max knob for zswap
> configuration.

There is a longer-term advantage of sticking to the usage+limit pair
precedent. Even though the usage knob happens to correspond to a
memory.stat item in this case, we had also discussed the possibility
of breaking out a "Compressed" item in memory.stat instead, of which
zswap would only be a subset. It didn't pan out this way this time -
for unrelated reasons. But it's conceivable this will happen in
another scenario down the line, and then you'd need a separate usage
knob anyway. It's a good idea to stay consistent.

There is also an immediate practical advantage. zswap is limitable, so
an auto-tuning service might want to monitor its usage at a higher
frequency, with a higher precision, and with a higher urgency than
memory stats are ususally logged. A dedicated usage knob allows doing
that. memory.stat does not: it is a bigger file that needs to be
searched with a string parser for every sample; it's flushed lazily,
so it can be less precise than desired; yet, when it does flush, it
flushes the entire tree rather than just the target group, making it
more expensive an erratic than desired as well.

> > It needs to be configured to the workload's access frequency curve,
> > which can be done with trial-and-error (reasonable balance between
> > zswpins and pswpins) or in a more targeted manner using tools such as
> > page_idle, damon etc.
> > [...]
> > Because for load tuning, bytes make much more sense. That's how you
> > measure the workingset, so a percentage is an awkward indirection. At
> > the cgroup level, it makes even less sense: all memcg tunables are in
> > bytes, it would be quite weird to introduce a "max" that is 0-100. Add
> > the confusion of how percentages would propagate down the hierarchy...
> 
> Thanks for the explanation. I guess there's no simple tranformation of
> in-kernel available information that'd allow a more semantic
> configuration of this value. The rather crude absolute value requires
> (but also simply allows) some calibration or responsive tuning.

Right.

If you think about it, the same can be said about the memory limit
itself. It's a crude, absolute number the kernel asks of you. Yet the
optimal setting depends on the workload's ever-changing access
frequency histogram, the speed of the storage backend for paging and
swapping, and the workload's tolerances for paging latencies.

Hopefully one day we'll be able to set a pressure/latency threshold on
the cgroup, and have the kernel optimally distribute the workingset
across RAM, zswap, second-tier memory, and storage - preferring the
cheapest and slowest possible backing for every page while meeting the
SLA. The proactive reclaim and memory offloading work is pursuing this.

For now, we need the interface that allows tuning and exploring from
userspace. When there is something that's ready for a general purpose
OS kernel, those absolute knobs won't get in the way - just like
memory.max doesn't get in the way of proactive reclaim today.

Anyway, I'm just outlining where I'm coming from with this. It looks
like we agree on the end result.

> > Flushing unnecessary groups with a ratelimit doesn't sound like an
> > improvement to me.
> 
> Then I'm only concerned about a situation when there's a single deep
> memcg that undergoes both workingset_refault() and zswap querying.
> The latter (bare call to cgroup_rstat_flush()) won't reset
> stats_flush_threshold, so the former (or the async flush more likely)
> would attempt a flush too. The flush work (on the leaf memcg) would be
> done twice even though it may be within the tolerance of cumulated
> error the second time.
> 
> This is a thing that might require attention in the future (depending on
> some data how it actually performs). I see how the current approach is
> justified.

Yes, we can optimize it should the need arise. So far it's been fine.

Thanks for your thoughts, Michal.