"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: > "Huang, Ying" <ying.huang@xxxxxxxxx> writes: > >> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >> >>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes: >>> >>>> Hi, Jagdish, >>>> >>>> Jagdish Gediya <jvgediya@xxxxxxxxxxxxx> writes: >>>> >>> >>> ... >>> >>>>> e.g. with below NUMA topology, where node 0 & 1 are >>>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>>> only nodes, and node 4 is slowest memory only node, >>>>> >>>>> available: 5 nodes (0-4) >>>>> node 0 cpus: 0 1 >>>>> node 0 size: n MB >>>>> node 0 free: n MB >>>>> node 1 cpus: 2 3 >>>>> node 1 size: n MB >>>>> node 1 free: n MB >>>>> node 2 cpus: >>>>> node 2 size: n MB >>>>> node 2 free: n MB >>>>> node 3 cpus: >>>>> node 3 size: n MB >>>>> node 3 free: n MB >>>>> node 4 cpus: >>>>> node 4 size: n MB >>>>> node 4 free: n MB >>>>> node distances: >>>>> node 0 1 2 3 4 >>>>> 0: 10 20 40 40 80 >>>>> 1: 20 10 40 40 80 >>>>> 2: 40 40 10 40 80 >>>>> 3: 40 40 40 10 80 >>>>> 4: 80 80 80 80 10 >>>>> >>>>> The existing implementation gives below demotion targets, >>>>> >>>>> node demotion_target >>>>> 0 3, 2 >>>>> 1 4 >>>>> 2 X >>>>> 3 X >>>>> 4 X >>>>> >>>>> With this patch applied, below are the demotion targets, >>>>> >>>>> node demotion_target >>>>> 0 3, 2 >>>>> 1 3, 2 >>>>> 2 3 >>>>> 3 4 >>>>> 4 X >>>> >>>> For such machine, I think the perfect demotion order is, >>>> >>>> node demotion_target >>>> 0 2, 3 >>>> 1 2, 3 >>>> 2 4 >>>> 3 4 >>>> 4 X >>> >>> I guess the "equally slow nodes" is a confusing definition here. Now if the >>> system consists of 2 1GB equally slow memory and the firmware doesn't want to >>> differentiate between them, firmware can present a single NUMA node >>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint >>> that there is some difference between these two memory devices. This is >>> also captured by the fact that the distance between 2 and 3 is 40 and not 10. >> >> Do you have more information about this? > > Not sure I follow the question there. I was checking shouldn't firmware > do a single NUMA node if two memory devices are of the same type? How will > optane present such a config? Both the DIMMs will have the same > proximity domain value and hence dax kmem will add them to the same NUMA > node? Sorry for confusing. I just wanted to check whether you have more information about the machine configuration above. The machines in my hand have no complex NUMA topology as in the patch description. > If you are suggesting that firmware doesn't do that, then I agree with you > that a demotion target like the below is good. > > node demotion_target > 0 2, 3 > 1 2, 3 > 2 4 > 3 4 > 4 X > > We can also achieve that with a smiple change as below. Glad to see the demotion order can be implemented in a simple way. My concern is that is it necessary to do this? If there are real machines with the NUMA topology, then I think it's good to add the support. But if not, why do we make the code complex unnecessarily? I don't have these kind of machines, do you have and will have? > @@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void) > { > nodemask_t next_pass = NODE_MASK_NONE; > nodemask_t this_pass = NODE_MASK_NONE; > - nodemask_t used_targets = NODE_MASK_NONE; > + nodemask_t this_pass_used_targets = NODE_MASK_NONE; > int node, best_distance; > > /* > @@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void) > /* > * To avoid cycles in the migration "graph", ensure > * that migration sources are not future targets by > - * setting them in 'used_targets'. Do this only > + * setting them in 'this_pass_used_targets'. Do this only > * once per pass so that multiple source nodes can > * share a target node. > * > - * 'used_targets' will become unavailable in future > + * 'this_pass_used_targets' will become unavailable in future > * passes. This limits some opportunities for > * multiple source nodes to share a destination. > */ > - nodes_or(used_targets, used_targets, this_pass); > + nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass); > > for_each_node_mask(node, this_pass) { > + > + nodemask_t used_targets = this_pass_used_targets; > + > best_distance = -1; > > /* Best Regards, Huang, Ying