On Fri, May 13, 2022 at 01:08:13PM -0400, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > Right, it's accounted as a subset rather than fully disjointed. But it > is a limitable counter of its own, so I exported it as such, with a > current and a max knob. This is comparable to the kmem counter in v1. That counter and limit didn't turn out well. I liked the analogy to writeback (and dirty limit) better. > From an API POV it would be quite strange to have max for a counter > that has no current. Likewise it would be strange for a major memory > consumer to be missing from memory.stat. My understanding would be to have all memory.stat entries as you propose, no extra .current counter and the .max knob for zswap configuration. > It needs to be configured to the workload's access frequency curve, > which can be done with trial-and-error (reasonable balance between > zswpins and pswpins) or in a more targeted manner using tools such as > page_idle, damon etc. > [...] > Because for load tuning, bytes make much more sense. That's how you > measure the workingset, so a percentage is an awkward indirection. At > the cgroup level, it makes even less sense: all memcg tunables are in > bytes, it would be quite weird to introduce a "max" that is 0-100. Add > the confusion of how percentages would propagate down the hierarchy... Thanks for the explanation. I guess there's no simple tranformation of in-kernel available information that'd allow a more semantic configuration of this value. The rather crude absolute value requires (but also simply allows) some calibration or responsive tuning. > I don't traverse all ancestors, I bail on disabled groups and skip > unlimited ones. I admit I missed that. > Flushing unnecessary groups with a ratelimit doesn't sound like an > improvement to me. Then I'm only concerned about a situation when there's a single deep memcg that undergoes both workingset_refault() and zswap querying. The latter (bare call to cgroup_rstat_flush()) won't reset stats_flush_threshold, so the former (or the async flush more likely) would attempt a flush too. The flush work (on the leaf memcg) would be done twice even though it may be within the tolerance of cumulated error the second time. This is a thing that might require attention in the future (depending on some data how it actually performs). I see how the current approach is justified. Regards, Michal