Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 31 Oct 2023 00:27:04 -0400

On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:

> > This hopefully also explains why it's a global setting. The usecase is
> > different from conventional NUMA interleaving, which is used as a
> > locality measure: spread shared data evenly between compute
> > nodes. This one isn't about locality - the CXL tier doesn't have local
> > compute. Instead, the optimal spread is based on hardware parameters,
> > which is a global property rather than a per-workload one.
> 
> Well, I am not convinced about that TBH. Sure it is probably a good fit
> for this specific CXL usecase but it just doesn't fit into many others I
> can think of - e.g. proportional use of those tiers based on the
> workload - you get what you pay for.
> 
> Is there any specific reason for not having a new interleave interface
> which defines weights for the nodemask? Is this because the policy
> itself is very dynamic or is this more driven by simplicity of use?
> 

I had originally implemented it this way while experimenting with new
mempolicies.

https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@xxxxxxxxxxxx/

The downside of doing it in mempolicy is...
1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
   non-trivial task.  It is very "current-task" centric.

2) Barring a change to mempolicy to be sysfs friendly, the options for
   implementing weights in the mempolicy are either a) new flag and
   setting every weight individually in many syscalls, or b) a new
   syscall (set_mempolicy2), which is what I demonstrated in the RFC.

3) mempolicy is also subject to cgroup nodemasks, and as a result you
   end up with a rats nest of interactions between mempolicy nodemasks
   changing as a result of cgroup migrations, nodes potentially coming
   and going (hotplug under CXL), and others I'm probably forgetting.

   Basically:  If a node leaves the nodemask, should you retain the
   weight, or should you reset it? If a new node comes into the node
   mask... what weight should you set? I did not have answers to these
   questions.

It was recommended to explore placing it in tiers instead, so I took a
crack at it here: 

https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@xxxxxxxxxxxx/

This had similar issue with the idea of hotplug nodes: if you give a
tier a weight, and one or more of the nodes goes away/comes back... what
should you do with the weight?  Split it up among the remaining nodes?
Rebalance? Etc.

The result of this discussion lead us to simply say "What if we place
the weights directly in the node".  And that lead us to this RFC.

I am not against implementing it in mempolicy (as proof: my first RFC).
I am simply searching for the acceptable way to implement it.

One of the benefits of having it set as a global setting is that weights
can be automatically generated from HMAT/HMEM information (ACPI tables)
and programs already using MPOL_INTERLEAVE will have a direct benefit.

I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added
along side this patch so that MPOL_INTERLEAVE is left entirely alone.

Happy to discuss more,
~Gregory