On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > > This hopefully also explains why it's a global setting. The usecase is > > different from conventional NUMA interleaving, which is used as a > > locality measure: spread shared data evenly between compute > > nodes. This one isn't about locality - the CXL tier doesn't have local > > compute. Instead, the optimal spread is based on hardware parameters, > > which is a global property rather than a per-workload one. > > Well, I am not convinced about that TBH. Sure it is probably a good fit > for this specific CXL usecase but it just doesn't fit into many others I > can think of - e.g. proportional use of those tiers based on the > workload - you get what you pay for. > > Is there any specific reason for not having a new interleave interface > which defines weights for the nodemask? Is this because the policy > itself is very dynamic or is this more driven by simplicity of use? > I had originally implemented it this way while experimenting with new mempolicies. https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@xxxxxxxxxxxx/ The downside of doing it in mempolicy is... 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a non-trivial task. It is very "current-task" centric. 2) Barring a change to mempolicy to be sysfs friendly, the options for implementing weights in the mempolicy are either a) new flag and setting every weight individually in many syscalls, or b) a new syscall (set_mempolicy2), which is what I demonstrated in the RFC. 3) mempolicy is also subject to cgroup nodemasks, and as a result you end up with a rats nest of interactions between mempolicy nodemasks changing as a result of cgroup migrations, nodes potentially coming and going (hotplug under CXL), and others I'm probably forgetting. Basically: If a node leaves the nodemask, should you retain the weight, or should you reset it? If a new node comes into the node mask... what weight should you set? I did not have answers to these questions. It was recommended to explore placing it in tiers instead, so I took a crack at it here: https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@xxxxxxxxxxxx/ This had similar issue with the idea of hotplug nodes: if you give a tier a weight, and one or more of the nodes goes away/comes back... what should you do with the weight? Split it up among the remaining nodes? Rebalance? Etc. The result of this discussion lead us to simply say "What if we place the weights directly in the node". And that lead us to this RFC. I am not against implementing it in mempolicy (as proof: my first RFC). I am simply searching for the acceptable way to implement it. One of the benefits of having it set as a global setting is that weights can be automatically generated from HMAT/HMEM information (ACPI tables) and programs already using MPOL_INTERLEAVE will have a direct benefit. I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added along side this patch so that MPOL_INTERLEAVE is left entirely alone. Happy to discuss more, ~Gregory