On Mon, Dec 12, 2022 at 10:32 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Mina Almasry <almasrymina@xxxxxxxxxx> writes: > > > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > >> > >> On Fri 02-12-22 14:35:31, Mina Almasry wrote: > >> > The nodes= arg instructs the kernel to only scan the given nodes for > >> > proactive reclaim. For example use cases, consider a 2 tier memory system: > >> > > >> > nodes 0,1 -> top tier > >> > nodes 2,3 -> second tier > >> > > >> > $ echo "1m nodes=0" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory from node 0. > >> > Since node 0 is a top tier node, demotion will be attempted first. This > >> > is useful to direct proactive reclaim to specific nodes that are under > >> > pressure. > >> > > >> > $ echo "1m nodes=2,3" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > >> > since this tier of memory has no demotion targets the memory will be > >> > reclaimed. > >> > > >> > $ echo "1m nodes=0,1" > memory.reclaim > >> > > >> > Instructs the kernel to reclaim memory from the top tier nodes, which can > >> > be desirable according to the userspace policy if there is pressure on > >> > the top tiers. Since these nodes have demotion targets, the kernel will > >> > attempt demotion first. > >> > > >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > >> > reclaim""), the proactive reclaim interface memory.reclaim does both > >> > reclaim and demotion. Reclaim and demotion incur different latency costs > >> > to the jobs in the cgroup. Demoted memory would still be addressable > >> > by the userspace at a higher latency, but reclaimed memory would need to > >> > incur a pagefault. > >> > > >> > The 'nodes' arg is useful to allow the userspace to control demotion > >> > and reclaim independently according to its policy: if the memory.reclaim > >> > is called on a node with demotion targets, it will attempt demotion first; > >> > if it is called on a node without demotion targets, it will only attempt > >> > reclaim. > >> > > >> > Acked-by: Michal Hocko <mhocko@xxxxxxxx> > >> > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx> > >> > >> After discussion in [1] I have realized that I haven't really thought > >> through all the consequences of this patch and therefore I am retracting > >> my ack here. I am not nacking the patch at this statge but I also think > >> this shouldn't be merged now and we should really consider all the > >> consequences. > >> > >> Let me summarize my main concerns here as well. The proposed > >> implementation doesn't apply the provided nodemask to the whole reclaim > >> process. This means that demotion can happen outside of the mask so the > >> the user request cannot really control demotion targets and that limits > >> the interface should there be any need for a finer grained control in > >> the future (see an example in [2]). > >> Another problem is that this can limit future reclaim extensions because > >> of existing assumptions of the interface [3] - specify only top-tier > >> node to force the aging without actually reclaiming any charges and > >> (ab)use the interface only for aging on multi-tier system. A change to > >> the reclaim to not demote in some cases could break this usecase. > >> > > > > I think this is correct. My use case is to request from the kernel to > > do demotion without reclaim in the cgroup, and the reason for that is > > stated in the commit message: > > > > "Reclaim and demotion incur different latency costs to the jobs in the > > cgroup. Demoted memory would still be addressable by the userspace at > > a higher latency, but reclaimed memory would need to incur a > > pagefault." > > > > For jobs of some latency tiers, we would like to trigger proactive > > demotion (which incurs relatively low latency on the job), but not > > trigger proactive reclaim (which incurs a pagefault). I initially had > > proposed a separate interface for this, but Johannes directed me to > > this interface instead in [1]. In the same email Johannes also tells > > me that meta's reclaim stack relies on memory.reclaim triggering > > demotion, so it seems that I'm not the first to take a dependency on > > this. Additionally in [2] Johannes also says it would be great if in > > the long term reclaim policy and demotion policy do not diverge. > > > > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@xxxxxxxxxxx/ > > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@xxxxxxxxxxx/ > > After these discussion, I think the solution maybe use different > interfaces for "proactive demote" and "proactive reclaim". That is, > reconsider "memory.demote". In this way, we will always uncharge the > cgroup for "memory.reclaim". This avoid the possible confusion there. > And, because demotion is considered aging, we don't need to disable > demotion for "memory.reclaim", just don't count it. +1 on memory.demote. > Best Regards, > Huang, Ying >