On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > > Because we usually have multiple nodes in one mem-tier, I still think > mem-tier-based interface is simpler than node-based. But, it seems more > complex to introduce mem-tier into mempolicy. Especially if we have > per-task weights. So, I am fine to go with node-based interface. > > > * cgroups: "this doesn't involve dynamic resource accounting / > > enforcement at all" and "these aren't resource > > allocations, it's unclear what the hierarchical > > relationship mean". > > > > * node: too global, explore smaller scope first then expand. > > Why is it too global? I understand that it doesn't cover all possible > use cases (although I don't know whether these use cases are practical > or not). But it can provide a reasonable default per-node weight based > on available node performance information (such as, HMAT, CDAT, etc.). > And, quite some workloads can just use it. I think this is an useful > feature. > Have been sharing notes with more folks. Michal thinks a global set of weights is unintuitive and not useful, and would prefer to see the per-task weights first. Though this may have been in response to adding it as an attribute of nodes directly. Another proposal here suggested adding a new sysfs setting https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a $ tree /sys/kernel/mm/interleave_weight/ /sys/kernel/mm/interleave_weight/ ├── enabled [1] ├── possible [2] └── node ├── node0 │ └── interleave_weight [3] └── node1 └── interleave_weight [3] (this could be changed to /sys/kernel/mm/mempolicy/...) I think the internal representation of this can be simplified greatly, over what the patch provides now, but maybe this solves the "it doesn't belong in these other components" issue. Answer: Simply leave it as a static global kobject in mempolicy, which also deals with many of the issues regarding race conditions. If a user provides weights, use those. If they do not, use globals. On a cpuset rebind event (container migration, mems_allowed changes), manually set weights would have to remain, so in a bad case, the weights would be very out of line with the real distribution of memory. Example: if your nodemask is (0,1,2) and a migration changes it to (3,4,5), then unfortunately your weights will likely revert to [1,1,1] If set with global weights, they could automatically adjust. It would not be perfect, but it would be better than the potential worst case above. If that same migration occurs, the next allocation would simply use whatever the target node weights are in the global config. So if globally you have weights [3,2,1,1,2,3], and you move from nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to [1,2,3]. If the structure is built as a matrix of (cpu_node,mem_nodes), the you can also optimize based on the node the task is running on. That feels very intuitive, deals with many race condition issues, and the global setting can actually be implemented without the need for set_mempolicy2 at all - which is certainly a bonus. Would love more thoughts here. Will have a new RFC with set_mempolicy2, mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above. Regards ~Gregory