"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: > "Huang, Ying" <ying.huang@xxxxxxxxx> writes: > >> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >> >>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes: >>> >>>> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> writes: >>>> >>>>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes: >>>>> >>>>>> Hi, Jagdish, >>>>>> >>>>>> Jagdish Gediya <jvgediya@xxxxxxxxxxxxx> writes: >>>>>> >>>>> >>>>> ... >>>>> >>>>>>> e.g. with below NUMA topology, where node 0 & 1 are >>>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>>>>> only nodes, and node 4 is slowest memory only node, >>>>>>> >>>>>>> available: 5 nodes (0-4) >>>>>>> node 0 cpus: 0 1 >>>>>>> node 0 size: n MB >>>>>>> node 0 free: n MB >>>>>>> node 1 cpus: 2 3 >>>>>>> node 1 size: n MB >>>>>>> node 1 free: n MB >>>>>>> node 2 cpus: >>>>>>> node 2 size: n MB >>>>>>> node 2 free: n MB >>>>>>> node 3 cpus: >>>>>>> node 3 size: n MB >>>>>>> node 3 free: n MB >>>>>>> node 4 cpus: >>>>>>> node 4 size: n MB >>>>>>> node 4 free: n MB >>>>>>> node distances: >>>>>>> node 0 1 2 3 4 >>>>>>> 0: 10 20 40 40 80 >>>>>>> 1: 20 10 40 40 80 >>>>>>> 2: 40 40 10 40 80 >>>>>>> 3: 40 40 40 10 80 >>>>>>> 4: 80 80 80 80 10 >>>>>>> >>>>>>> The existing implementation gives below demotion targets, >>>>>>> >>>>>>> node demotion_target >>>>>>> 0 3, 2 >>>>>>> 1 4 >>>>>>> 2 X >>>>>>> 3 X >>>>>>> 4 X >>>>>>> >>>>>>> With this patch applied, below are the demotion targets, >>>>>>> >>>>>>> node demotion_target >>>>>>> 0 3, 2 >>>>>>> 1 3, 2 >>>>>>> 2 3 >>>>>>> 3 4 >>>>>>> 4 X >>>>>> >>>>>> For such machine, I think the perfect demotion order is, >>>>>> >>>>>> node demotion_target >>>>>> 0 2, 3 >>>>>> 1 2, 3 >>>>>> 2 4 >>>>>> 3 4 >>>>>> 4 X >>>>> >>>>> I guess the "equally slow nodes" is a confusing definition here. Now if the >>>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to >>>>> differentiate between them, firmware can present a single NUMA node >>>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint >>>>> that there is some difference between these two memory devices. This is >>>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10. >>>> >>>> Do you have more information about this? >>> >>> Not sure I follow the question there. I was checking shouldn't firmware >>> do a single NUMA node if two memory devices are of the same type? How will >>> optane present such a config? Both the DIMMs will have the same >>> proximity domain value and hence dax kmem will add them to the same NUMA >>> node? >> >> Sorry for confusing. I just wanted to check whether you have more >> information about the machine configuration above. The machines in my >> hand have no complex NUMA topology as in the patch description. > > > Even with simple topologies like below > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: 4046 MB > node 0 free: 3478 MB > node 1 cpus: 2 3 > node 1 size: 4090 MB > node 1 free: 3430 MB > node 2 cpus: > node 2 size: 4074 MB > node 2 free: 4037 MB > node distances: > node 0 1 2 > 0: 10 20 40 > 1: 20 10 40 > 2: 40 40 10 > > With current code we get demotion targets assigned as below > > [ 0.337307] Demotion nodes for Node 0: 2 > [ 0.337351] Demotion nodes for Node 1: > [ 0.337380] Demotion nodes for Node 2: > > I guess we should fix that to be below? > > [ 0.344554] Demotion nodes for Node 0: 2 > [ 0.344605] Demotion nodes for Node 1: 2 > [ 0.344638] Demotion nodes for Node 2: If the cross-socket link has enough bandwidth to accommodate the PMEM throughput, the new one is better. If it hasn't, the old one may be better. So, I think we need some kind of user space overridden support here. Right? > Most of the tests we are doing are using Qemu to simulate this. We > started looking at this to avoid using demotion completely when slow > memory is not present. ie, we should have a different way to identify > demotion targets other than node_states[N_MEMORY]. Virtualized platforms > can have configs with memory only NUMA nodes with DRAM and we don't > want to consider those as demotion targets. Even if the demotion targets are set for some node, the demotion will not work before enabling demotion via sysfs (/sys/kernel/mm/numa/demotion_enabled). So for system without slow memory, just don't enable demotion. > While we are at it can you let us know how topology will look on a > system with two optane DIMMs? Do both appear with the same > target_node? In my test system, multiple optane DIMMs in one socket will be represented as one NUMA node. I remember Baolin has different configuration. Hi, Baolin, Can you provide some information about this? >> >>> If you are suggesting that firmware doesn't do that, then I agree with you >>> that a demotion target like the below is good. >>> >>> node demotion_target >>> 0 2, 3 >>> 1 2, 3 >>> 2 4 >>> 3 4 >>> 4 X >>> >>> We can also achieve that with a smiple change as below. >> >> Glad to see the demotion order can be implemented in a simple way. >> >> My concern is that is it necessary to do this? If there are real >> machines with the NUMA topology, then I think it's good to add the >> support. But if not, why do we make the code complex unnecessarily? >> >> I don't have these kind of machines, do you have and will have? >> > > > Based on the above, we still need to get the simpler fix merged right? Or user overridden support? Best Regards, Huang, Ying [snip]