Hello Michal, On Fri, May 13, 2022 at 05:14:26PM +0200, Michal Koutný wrote: > On Wed, May 11, 2022 at 03:06:56PM -0400, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > Correct. After which the uncompressed page is reclaimed and uncharged. > > So the zswapout process will reduce the charge bottom line. > > A zswap object falling under memory.current was my first thinking, I was > confused why it's exported as a separate counter memory.zswap.current > (which IMO suggests disjoint counting) and it doubles a > memory.stat:zswap entry. > > Is the separate memory.zswap.current good for anything? (Except maybe > avoiding global rstat flush on memory.stat read but that'd be an > undesired precendent.) Right, it's accounted as a subset rather than fully disjointed. But it is a limitable counter of its own, so I exported it as such, with a current and a max knob. This is comparable to the kmem counter in v1. >From an API POV it would be quite strange to have max for a counter that has no current. Likewise it would be strange for a major memory consumer to be missing from memory.stat. > (Ad the eventually reduced footprint, the transitional excursion above > memcg's (or ancestor's) limit should be limited by number of parallel > reclaims running (each one at most a page, right?), so it doesn't seem > necessary to tackle (now).) Correct. > > memory.zswap.* are there to configure zswap policy, within the > > boundaries of available memory - it's by definition a subset. > > I see how the .max works when equal to 0 or "max". The intermediate > values are more difficult to reason about. It needs to be configured to the workload's access frequency curve, which can be done with trial-and-error (reasonable balance between zswpins and pswpins) or in a more targeted manner using tools such as page_idle, damon etc. > Also, I can see that on the global level, zswap is configured relatively > (/sys/module/zswap/parameters/max_pool_percent). > You wrote that the actual configured value is workload specific, would > it be simpler to have also relative zswap limit per memcg? > > (Relative wrt memory.max, it'd be rather just a convenience with this > simple ratio, however, it'd correspond to the top level limit. OTOH, the > relatives would have counter-intuitive hierarchical behavior. I don't > mean this should be changed, rather wondering why this variant was > chosen.) A percentage isn't a bad way to pick a global default limit for a kernel feature. But it would have been preferable if zswap had used the percentage internally and made the knob based in bytes (like min_free_kbytes for example). Because for load tuning, bytes make much more sense. That's how you measure the workingset, so a percentage is an awkward indirection. At the cgroup level, it makes even less sense: all memcg tunables are in bytes, it would be quite weird to introduce a "max" that is 0-100. Add the confusion of how percentages would propagate down the hierarchy... > > +bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > +{ > > + struct mem_cgroup *memcg, *original_memcg; > > + bool ret = true; > > + > > + original_memcg = get_mem_cgroup_from_objcg(objcg); > > + for (memcg = original_memcg; memcg != root_mem_cgroup; > > + memcg = parent_mem_cgroup(memcg)) { > > + unsigned long max = READ_ONCE(memcg->zswap_max); > > + unsigned long pages; > > + > > + if (max == PAGE_COUNTER_MAX) > > + continue; > > + if (max == 0) { > > + ret = false; > > + break; > > + } > > + > > + cgroup_rstat_flush(memcg->css.cgroup); > > Here, I think it'd be better not to bypass mem_cgroup_flush_stats() (the > mechanism is approximate and you traverse all ancestors anyway), i.e. > mem_cgroup_flush_stats() before the loop instead of this. I don't traverse all ancestors, I bail on disabled groups and skip unlimited ones. This saves a lot of flushes in practice right now: our heaviest swapping cgroups have zswap disabled (max=0) because they're lowpri and forced to disk. Likewise, the zswap users have their zswap limit several levels down from the root, and I currently don't ever flush the higher levels (max=PAGE_COUNTER_MAX). Flushing unnecessary groups with a ratelimit doesn't sound like an improvement to me. Thanks