Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

"Huang, Ying" <ying.huang@xxxxxxxxx> · Thu, 02 Nov 2023 10:01:20 +0800

Gregory Price <gregory.price@xxxxxxxxxxxx> writes:

> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>
>> > This hopefully also explains why it's a global setting. The usecase is
>> > different from conventional NUMA interleaving, which is used as a
>> > locality measure: spread shared data evenly between compute
>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>> > compute. Instead, the optimal spread is based on hardware parameters,
>> > which is a global property rather than a per-workload one.
>> 
>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>> for this specific CXL usecase but it just doesn't fit into many others I
>> can think of - e.g. proportional use of those tiers based on the
>> workload - you get what you pay for.
>> 
>> Is there any specific reason for not having a new interleave interface
>> which defines weights for the nodemask? Is this because the policy
>> itself is very dynamic or is this more driven by simplicity of use?
>> 
>
> I had originally implemented it this way while experimenting with new
> mempolicies.
>
> https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@xxxxxxxxxxxx/
>
> The downside of doing it in mempolicy is...
> 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
>    non-trivial task.  It is very "current-task" centric.
>
> 2) Barring a change to mempolicy to be sysfs friendly, the options for
>    implementing weights in the mempolicy are either a) new flag and
>    setting every weight individually in many syscalls, or b) a new
>    syscall (set_mempolicy2), which is what I demonstrated in the RFC.
>
> 3) mempolicy is also subject to cgroup nodemasks, and as a result you
>    end up with a rats nest of interactions between mempolicy nodemasks
>    changing as a result of cgroup migrations, nodes potentially coming
>    and going (hotplug under CXL), and others I'm probably forgetting.
>
>    Basically:  If a node leaves the nodemask, should you retain the
>    weight, or should you reset it? If a new node comes into the node
>    mask... what weight should you set? I did not have answers to these
>    questions.
>
>
> It was recommended to explore placing it in tiers instead, so I took a
> crack at it here: 
>
> https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@xxxxxxxxxxxx/
>
> This had similar issue with the idea of hotplug nodes: if you give a
> tier a weight, and one or more of the nodes goes away/comes back... what
> should you do with the weight?  Split it up among the remaining nodes?
> Rebalance? Etc.

The weight of a tier can be defined as the weight of one node of the
tier instead of the weight of all nodes of the tier.  That is, for a
system as follows,

tier 0: node 0, node 1; weight=4
tier 1: node 2, node 3; weight=1

If you run workload with `numactl --weighted-interleave -n 0,2,3`, the
proportion will be: "4:0:1:1" on each node.

While for `numactl --weighted-interleave -n 0,2`, it will be: "4:0:1:0".

--
Best Regards,
Huang, Ying

> The result of this discussion lead us to simply say "What if we place
> the weights directly in the node".  And that lead us to this RFC.
>
>
> I am not against implementing it in mempolicy (as proof: my first RFC).
> I am simply searching for the acceptable way to implement it.
>
> One of the benefits of having it set as a global setting is that weights
> can be automatically generated from HMAT/HMEM information (ACPI tables)
> and programs already using MPOL_INTERLEAVE will have a direct benefit.
>
> I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added
> along side this patch so that MPOL_INTERLEAVE is left entirely alone.
>
> Happy to discuss more,
> ~Gregory