On Wed, Mar 09, 2022 at 02:30:24PM -0800, David Rientjes wrote: > On Mon, 7 Mar 2022, Johannes Weiner wrote: > > > > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for > > > each NUMA node N on the system. (It would be similar to the existing > > > per-node sysfs "compact" mechanism used to trigger compaction from > > > userspace.) > > > > I generally think a proactive reclaim interface is a good idea. > > > > A per-cgroup control knob would make more sense to me, as cgroupfs > > takes care of delegation, namespacing etc. and so would permit > > self-directed proactive reclaim inside containers. > > > > This is an interesting point and something that would need to be decided. > There's pros and cons to both approaches, per-cgroup mechanism vs purely a > per-node sysfs mechanism that can take a cgroup id. I think we can just add both and avoid the cgroupid quirk. We've done this many times: psi has global and cgroupfs interfaces, so does vmstat, so does (did) swappiness etc. I don't see a problem with adding a system and a cgroup interface for this. > The reason we'd like this in sysfs is because of users who do not enable > CONFIG_MEMCG but would still benefit from proactive reclaim. Such users > do exist and do not rely on memcg, such as Chrome OS, and from my > understanding this is normally done to speed up hibernation. Yes, that makes sense. > But I note your use of "per-cgroup" control knob and not specifically > "per-memcg". Were you considering a proactive reclaim mechanism for a > cgroup other than memcg? A new one? No subtle nuance intended, I'm just using them interchangeably with cgroup2. I meant: a cgroup that has the memory controller enabled :) > I'm wondering if it would make sense for such a cgroup interface, if > eventually needed, to be added incrementally on top of a per-node sysfs > interface. (We know today that there is a need for proactive reclaim for > users who do not use memcg at all.) We've already had delegated deployments as well. Both uses are real. But again, I don't think we have to choose at all. Let's add both! > > > Userspace would write the following to this file: > > > - nr_to_reclaim pages > > > > This makes sense, although (and you hinted at this below), I'm > > thinking it should be in bytes, especially if part of cgroupfs. > > > > If we agree upon a sysfs interface I assume there would be no objection to > this in nr_to_reclaim pages? I agree if this is to be a memcg knob that > it should be expressed in bytes for consistency with other knobs. Pages in general are somewhat fraught as a unit for facing userspace. It requires people to use _SC_PAGESIZE, but they don't: https://twitter.com/marcan42/status/1498710903675842563 Is there an argument *for* using pages? > > > - swappiness factor > > > > This I'm not sure about. > > > > Mostly because I'm not sure about swappiness in general. It balances > > between anon and file, but both of them are aged according to the same > > LRU rules. The only reason to prefer one over the other seems to be > > when the cost of reloading one (refault vs swapin) isn't the same as > > the other. That's usually a hardware property, which in a perfect > > world we'd auto-tune inside the kernel based on observed IO > > performance. Not sure why you'd want this per reclaim request. > > > > > - flags to specify context, if any[**] > > > > > > [**] this is offered for extensibility to specify the context in which > > > reclaim is being done (clean file pages only, demotion for memory > > > tiering vs eviction, etc), otherwise 0 > > > > This one is curious. I don't understand the use cases for either of > > these examples, and I can't think of other flags a user may pass on a > > per-invocation basis. Would you care to elaborate some? > > > > If we combine the above two concerns, maybe only a flags argument is > sufficient where you can specify only anon or only file (and neither means > both)? What is controllable by swappiness could be controlled by two > different writes to the interface, one for (possibly) anon and one for > (possibly) file. > > There was discussion about treating the two different types of memory > differently as a function of reload cost, cost of doing I/O for discard, > and how much swap space we want proactive reclaim to take, as well as the > only current alternative is to be playing with the global vm.swappiness. > > Michal asked if this would include slab reclaim or shrinkers, I think the > answer is "possibly yes," but no initial use case for this (flags would be > extensible to permit the addition of it incrementally). In fact, if you > were to pass a cgroup id of 0 to induce global proactive reclaim you could > mimic the same control we have with vm.drop_caches today but does not > include reclaiming all of a memory type. Ok, I think I see. My impression is that this is mechanism that optimally the kernel's reclaim algorithm should provide, rather than (just) application/setup dependent policy preferences. The cost of reload for example. Yes, it needs to be balanced between anon and file. But is there a target to aim for besides lowest aggregate paging overhead for the application? How much swap space to use is a good point too, but we already have an expression of intended per-cgroup share from the user: memory.swap.high and memory.swap.max. Shouldn't reclaim in general back off gradually from swap as utilization approaches 100%? Is proactive reclaim different from conventional reclaim in this regard? The write endurance question is similar. Policy would be to express a global budget and per-cgroup shares of that budget; mechanism would be to have this inform reclaim and writeback behavior. My question would be why the mechanism *shouldn't* live in the kernel. And then allow userspace to configure it in a way in which most people actually understand: flash write budgets, swap space allowances etc. The interface proposed here strikes me as rather low-level. It's less of a conventional user interface, as much as it is building blocks for implementing parts of the reclaim algorithm in userspace. I'm not necessarily against that. It's just unusual and IMO deserves some more discussion. I want to make sure that if there are shortcomings in the kernel we address them rather than work around.