Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote: > >> > This hopefully also explains why it's a global setting. The usecase is >> > different from conventional NUMA interleaving, which is used as a >> > locality measure: spread shared data evenly between compute >> > nodes. This one isn't about locality - the CXL tier doesn't have local >> > compute. Instead, the optimal spread is based on hardware parameters, >> > which is a global property rather than a per-workload one. >> >> Well, I am not convinced about that TBH. Sure it is probably a good fit >> for this specific CXL usecase but it just doesn't fit into many others I >> can think of - e.g. proportional use of those tiers based on the >> workload - you get what you pay for. >> >> Is there any specific reason for not having a new interleave interface >> which defines weights for the nodemask? Is this because the policy >> itself is very dynamic or is this more driven by simplicity of use? >> > > I had originally implemented it this way while experimenting with new > mempolicies. > > https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@xxxxxxxxxxxx/ > > The downside of doing it in mempolicy is... > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a > non-trivial task. It is very "current-task" centric. > > 2) Barring a change to mempolicy to be sysfs friendly, the options for > implementing weights in the mempolicy are either a) new flag and > setting every weight individually in many syscalls, or b) a new > syscall (set_mempolicy2), which is what I demonstrated in the RFC. > > 3) mempolicy is also subject to cgroup nodemasks, and as a result you > end up with a rats nest of interactions between mempolicy nodemasks > changing as a result of cgroup migrations, nodes potentially coming > and going (hotplug under CXL), and others I'm probably forgetting. > > Basically: If a node leaves the nodemask, should you retain the > weight, or should you reset it? If a new node comes into the node > mask... what weight should you set? I did not have answers to these > questions. > > > It was recommended to explore placing it in tiers instead, so I took a > crack at it here: > > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@xxxxxxxxxxxx/ > > This had similar issue with the idea of hotplug nodes: if you give a > tier a weight, and one or more of the nodes goes away/comes back... what > should you do with the weight? Split it up among the remaining nodes? > Rebalance? Etc. The weight of a tier can be defined as the weight of one node of the tier instead of the weight of all nodes of the tier. That is, for a system as follows, tier 0: node 0, node 1; weight=4 tier 1: node 2, node 3; weight=1 If you run workload with `numactl --weighted-interleave -n 0,2,3`, the proportion will be: "4:0:1:1" on each node. While for `numactl --weighted-interleave -n 0,2`, it will be: "4:0:1:0". -- Best Regards, Huang, Ying > The result of this discussion lead us to simply say "What if we place > the weights directly in the node". And that lead us to this RFC. > > > I am not against implementing it in mempolicy (as proof: my first RFC). > I am simply searching for the acceptable way to implement it. > > One of the benefits of having it set as a global setting is that weights > can be automatically generated from HMAT/HMEM information (ACPI tables) > and programs already using MPOL_INTERLEAVE will have a direct benefit. > > I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added > along side this patch so that MPOL_INTERLEAVE is left entirely alone. > > Happy to discuss more, > ~Gregory