On Thu, Sep 26, 2024 at 4:40 PM Ivan Shapovalov <intelfx@xxxxxxxxxxxx> wrote: > > On 2024-09-26 at 16:12 -0700, Nhat Pham wrote: > > On Thu, Sep 26, 2024 at 3:55 PM Ivan Shapovalov <intelfx@xxxxxxxxxxxx> wrote: > > > > > > Improve the inheritance behavior of the `memory.zswap.writeback` cgroup > > > attribute introduced during the 6.11 cycle. Specifically, in 6.11 we > > > walk the parent cgroups until we find a _disabled_ writeback, which does > > > not allow the user to selectively enable zswap writeback while having it > > > disabled by default. > > > > Is there an actual need for this? This is a theoretical use case I > > thought of (and raised), but I don't think anybody actually wants > > this...? > > This is of course anecdata, but yes, it does solve a real use-case that > I'm having right now, as well as a bunch of my colleagues who recently > complained to me (in private) about pretty much the same use-case. > > The use-case is following: it turns out that it could be beneficial for > desktop systems to run with a pretty high swappiness and zswap > writeback globally disabled, to nudge the system to compress cold pages > but not actually write them back to the disk (which would happen pretty > aggressively if it was not disabled) to reduce I/O and latencies. > However, under this setup it is sometimes needed to re-enable zswap > writeback for specific memory-heavy applications that allocate a lot of > cold pages, to "allow" the kernel to actually swap those programs out. Out of pure curiosity (and to make sure I fully grasp the problem at hand), is this the capacity-based zswap writeback (i.e the one triggered by limits), or the memory pressure based dynamic shrinker? If you disable the latter and only rely on the former, it should not "write pages aggressively". Limits are rarely reached (IIUC, zswap.max are frequently used as binary knobs, and global limits are hard to reach), so usually pages that are going to disk swap are just pages zswap reject (i.e mostly just pages that fail to compress). This should be a very small category. You will still see "swap" usage due to a quirk in zswap architecture (which I'm working to fix), but there should rarely be any IOs. So the setup itself is not super necessary. If it's the latter then yeah I can kinda see the need for the setup. > > > > > Besides, most people who want this can just: > > > > 1. Enable zswap writeback on root cgroup (and all non-leaf cgroups). > > > > 2. Disable zswap writeback on leaf cgroups on creation by default. > > > > 3. Selectively enable zswap writeback for the leaf cgroups. > > > > All of this is quite doable in userspace. It's not even _that_ racy - > > just do this before adding tasks etc. to the cgroup? > > Well, yes, it is technically doable from userspace, just like it was > technically doable prior to commit e39925734909 to have userspace > explicitly control the entire hierarchy of writeback settings. > However it _is_ pretty painful, and the flow you described would > essentially negate any benefits of that patch (it would require > userspace to, once again, manage the entire hierarchy explicitly > without any help from the kernel). I think it's a tad different. In the case of the commit e39925734909, the hierarchical behavior of zswap.writeback knob is quite semantically confusing, almost counter-intuitive (and does not conform to the convention of other cgroup knobs, which use the most restrictive limit - check out zswap.max limit checking for example). That commit arguably fixes it for the "common" case (i.e you want the hierarchical enforcement to hold for the most part). That's why there were even conversations about cc-ing the stable mailing list for backporting it to older kernels. This is more of a "new use case" patch. It complicates the API, for something readily doable in userspace - the kernel does not do anything that userspace cannot achieve. So it should undergo stricter scrutiny. :) Anyway, Yosry, Johannes, how do you feel about this?