On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote: > > I do recognize your need to control the demotion but I argue that it is > > a bad idea to rely on an implicit behavior of the memory reclaim and an > > interface which is _documented_ to primarily _reclaim_ memory. > > I think memory.reclaim should demote as part of page aging. What I'd > like to avoid is *having* to manually control the aging component in > the interface (e.g. making memory.reclaim *only* reclaim, and > *requiring* a coordinated use of memory.demote to ensure progress.) > > > Really, consider that the current demotion implementation will change > > in the future and based on a newly added heuristic memory reclaim or > > compression would be preferred over migration to a different tier. This > > might completely break your current assumptions and break your usecase > > which relies on an implicit demotion behavior. Do you see that as a > > potential problem at all? What shall we do in that case? Special case > > memory.reclaim behavior? > > Shouldn't that be derived from the distance propertiers in the tier > configuration? > > I.e. if local compression is faster than demoting to a slower node, we > should maybe have a separate tier for that. Ignoring proactive reclaim > or demotion commands for a second: on that node, global memory > pressure should always compress first, while the oldest pages from the > compression cache should demote to the other node(s) - until they > eventually get swapped out. > > However fine-grained we make proactive reclaim control over these > stages, it should at least be possible for the user to request the > default behavior that global pressure follows, without jumping through > hoops or requiring the coordinated use of multiple knobs. So IMO there > is an argument for having a singular knob that requests comprehensive > aging and reclaiming across the configured hierarchy. > > As far as explicit control over the individual stages goes - no idea > if you would call the compression stage demotion or reclaim. The > distinction still does not make much of sense to me, since reclaim is > just another form of demotion. Sure, page faults have a different > access latency than dax to slower memory. But you could also have 3 > tiers of memory where the difference between tier 1 and 2 is much > smaller than the difference between 2 and 3, and you might want to > apply different demotion rates between them as well. > > The other argument is that demotion does not free cgroup memory, > whereas reclaim does. But with multiple memory tiers of vastly > different performance, isn't there also an argument for granting > cgroups different shares of each memory? So that a higher priority > group has access to a bigger share of the fastest memory, and lower > prio cgroups are relegated to lower tiers. If we split those pools, > then "demotion" will actually free memory in a cgroup. > I would also like to say I implemented something in line with that in [1]. In this patch, pages demoted from inside the nodemask to outside the nodemask count as 'reclaimed'. This, in my mind, is a very generic solution to the 'should demoted pages count as reclaim?' problem, and will work in all scenarios as long as the nodemask passed to shrink_folio_list() is set correctly by the call stack. > This is why I liked adding a nodes= argument to memory.reclaim the > best. It doesn't encode a distinction that may not last for long. > > The problem comes from how to interpret the input argument and the > return value, right? Could we solve this by requiring the passed > nodes= to all be of the same memory tier? Then there is no confusion > around what is requested and what the return value means. > I feel like I arrived at a better solution in [1], where pages demoted from inside of the nodemask to outside count as reclaimed and the rest don't. But I think we could solve this by explicit checks that nodes= arg are from the same tier, yes. > And if no nodes are passed, it means reclaim (from the lowest memory > tier) X pages and demote as needed, then return the reclaimed pages. > > > Now to your specific usecase. If there is a need to do a memory > > distribution balancing then fine but this should be a well defined > > interface. E.g. is there a need to not only control demotion but > > promotions as well? I haven't heard anybody requesting that so far > > but I can easily imagine that like outsourcing the memory reclaim to > > the userspace someone might want to do the same thing with the numa > > balancing because $REASONS. Should that ever happen, I am pretty sure > > hooking into memory.reclaim is not really a great idea. > > Should this ever happen, it would seem fair that that be a separate > knob anyway, no? One knob to move the pipeline in one direction > (aging), one knob to move it the other way. [1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@xxxxxxxxxx/