On Thu, Dec 8, 2022 at 3:54 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Thu 08-12-22 01:00:40, Mina Almasry wrote: > > On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > > On Wed 07-12-22 13:43:55, Mina Almasry wrote: > > > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > [...] > > > > > Anyway a proper nr_reclaimed tracking should be rather straightforward > > > > > but I do not expect to make a big difference in practice > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index 026199c047e0..1b7f2d8cb128 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > > LIST_HEAD(ret_folios); > > > > > LIST_HEAD(free_folios); > > > > > LIST_HEAD(demote_folios); > > > > > - unsigned int nr_reclaimed = 0; > > > > > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > > > > > unsigned int pgactivate = 0; > > > > > bool do_demote_pass; > > > > > struct swap_iocb *plug = NULL; > > > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > > } > > > > > /* 'folio_list' is always empty here */ > > > > > > > > > > - /* Migrate folios selected for demotion */ > > > > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > > > > + /* > > > > > + * Migrate folios selected for demotion. > > > > > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > > > > > + * because no charges are really freed during the migration. Global > > > > > + * reclaim aims at releasing memory from nodes/zones so consider > > > > > + * demotion to reclaim memory. > > > > > + */ > > > > > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > > > > > + if (!cgroup_reclaim(sc)) > > > > > + nr_reclaimed += nr_demoted; > > > > > + > > > > > /* Folios that could not be demoted are still in @demote_folios */ > > > > > if (!list_empty(&demote_folios)) { > > > > > /* Folios which weren't demoted go back on @folio_list for retry: */ > > > > > > > > > > [...] > > > > > > > > Thank you again, but this patch breaks the memory.reclaim nodes arg > > > > for me. This is my test case. I run it on a machine with 2 memory > > > > tiers. > > > > > > > > Memory tier 1= nodes 0-2 > > > > Memory tier 2= node 3 > > > > > > > > mkdir -p /sys/fs/cgroup/unified/test > > > > cd /sys/fs/cgroup/unified/test > > > > echo $$ > cgroup.procs > > > > head -c 500m /dev/random > /tmp/testfile > > > > echo $$ > /sys/fs/cgroup/unified/cgroup.procs > > > > echo "1m nodes=0-2" > memory.reclaim > > > > > > > > In my opinion the expected behavior is for the kernel to demote 1mb of > > > > memory from nodes 0-2 to node 3. > > > > > > > > Actual behavior on the tip of mm-unstable is as expected. > > > > > > > > Actual behavior with your patch cherry-picked to mm-unstable is that > > > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and > > > > returns -EAGAIN to the user. This may be the correct behavior you're > > > > intending, but it completely breaks the use case I implemented the > > > > nodes= arg for and listed on the commit message of that change. > > > > > > Yes, strictly speaking the behavior is correct albeit unexpected. You > > > have told the kernel to _reclaim_ that much memory but demotion are > > > simply aging handling rather than a reclaim if the demotion target has a > > > lot of memory free. > > > > Yes, by the strict definition of reclaim, you're completely correct. > > But in reality earlier I proposed a patch to the kernel that disables > > demotion in proactive reclaim. That would have been a correct change > > by the strict definition of reclaim, but Johannes informed me that > > meta already has a dependency on proactive reclaim triggering demotion > > and directed me to add a nodes= arg to memory.reclaim to trigger > > demotion instead, to satisfy both use cases. Seems both us and meta > > are using this interface to trigger both reclaim and demotion, despite > > the strict definition of the word? > > Well, demotion is a part of aging and that is a part of the reclaim so I > believe we want both and demotion mostly an implementation detail. If > you want to have a very precise control then the nodemask should drive > you there. > > [...] > > > I am worried this will popping up again and again. I thought your nodes > > > subset approach could deal with this but I have overlooked one important > > > thing in your patch. The user provided nodemask controls where to > > > reclaim from but it doesn't constrain demotion targets. Is this > > > intentional? Would it actually make more sense to control demotion by > > > addint demotion nodes into the nodemask? > > > > > > > IMO, yes it is intentional, and no I would not recommend adding > > demotion nodes (I assume you mean adding both demote_from_nodes and > > demote_to_nodes as arg). > > What I really mean is to add demotion nodes to the nodemask along with > the set of nodes you want to reclaim from. To me that sounds like a > more natural interface allowing for all sorts of usecases: > - free up demotion targets (only specify demotion nodes in the mask) > - control where to demote (e.g. select specific demotion target(s)) > - do not demote at all (skip demotion nodes from the node mask) For clarification, do you mean to add another argument (e.g. demotion_nodes) in addition to the "nodes" argument? Also, which meaning do these arguments have? - Option 1: The "nodes" argument specifies the source nodes where pages should be reclaimed (but not demoted) from and the "demotion_nodes" argument specifies the source nodes where pages should be demoted (but not reclaimed) from. - Option 2: The "nodes" argument specifies the source nodes where pages should be reclaimed or demoted from and the "demotion_nodes" argument specifies the allowed demotion target nodes. Option 1 is more flexible in allowing various use cases. Option 2 can also support the use cases to reclaim only without demotion and demotion with fallback to reclaim, but cannot support demotion only without reclaim. > > My opinion is based on 2 reasons: > > > > 1. We control proactive reclaim by looking for nodes/tiers approaching > > pressure and triggering reclaim/demotion from said nodes/tiers. So we > > know the node/tier we would like to reclaim from, but not necessarily > > have a requirement on where the memory should go. I think it should be > > up to the kernel. > > 2. Currently I think most tiered machines will have 2 memory tiers, > > but I think the code is designed to support N memory tiers. What > > happens if N=3 and the user asks you to demote from the top tier nodes > > to the bottom tier nodes (skipping the middle tier)? The user in this > > case is explicitly asking to break the aging pipeline. From my short > > while on the mailing list I see great interest in respecting the aging > > pipeline, so it seems to me the demotion target should be decided by > > the kernel based on the aging pipeline, and the user should not be > > able to override it? I don't know. Maybe there is a valid use case for > > that somewhere. > > I agree that the agining should be preserved as much as possible unless > there is an explicit requirement to do otherwise which might be > something application specific. > > It is really hard to assume all the usecases at this stage but we should > keep in mind that the semantic of the interface will get cast into stone > once it is released. As of now I do see a great confusion point in the > nodemask semantic which pretends to allow some fine control while it is > largerly hard to predict because it makes some assumptions about the > reclaim while it has a very limited control of the demotion. > > -- > Michal Hocko > SUSE Labs