Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Michal Hocko <mhocko@xxxxxxxx> · Thu, 2 Nov 2023 10:47:33 +0100

On Wed 01-11-23 12:58:55, Gregory Price wrote:
> On Wed, Nov 01, 2023 at 02:45:50PM +0100, Michal Hocko wrote:
> > On Tue 31-10-23 00:27:04, Gregory Price wrote:
> [... snip ...]
> > > 
> > > The downside of doing it in mempolicy is...
> > > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
> > >    non-trivial task.  It is very "current-task" centric.
> > 
> > True. Cpusets is the way to make it less process centric but that comes
> > with its own constains (namely which NUMA policies are supported).
> >  
> > > 2) Barring a change to mempolicy to be sysfs friendly, the options for
> > >    implementing weights in the mempolicy are either a) new flag and
> > >    setting every weight individually in many syscalls, or b) a new
> > >    syscall (set_mempolicy2), which is what I demonstrated in the RFC.
> > 
> > Yes, that would likely require a new syscall.
> >  
> > > 3) mempolicy is also subject to cgroup nodemasks, and as a result you
> > >    end up with a rats nest of interactions between mempolicy nodemasks
> > >    changing as a result of cgroup migrations, nodes potentially coming
> > >    and going (hotplug under CXL), and others I'm probably forgetting.
> > 
> > Is this really any different from what you are proposing though?
> >
> 
> In only one manner: An external user can set the weight of a node that
> is added later on.  If it is implemented in mempolicy, then this is not
> possible.
> 
> Basically consider: `numactl --interleave=all ...`
> 
> If `--weights=...`: when a node hotplug event occurs, there is no
> recourse for adding a weight for the new node (it will default to 1).

Correct and this is what I was asking about in an earlier email. How
much do we really need to consider this setup. Is this something nice to
have or does the nature of the technology requires to be fully dynamic
and expect new nodes coming up at any moment?

> Maybe the answer is "Best effort, sorry" and we don't handle that
> situation.  That doesn't seem entirely unreasonable.
> 
> At least with weights in node (or cgroup, or memtier, whatever) it
> provides the ability to set that weight outside the mempolicy context.
> 
> > >    weight, or should you reset it? If a new node comes into the node
> > >    mask... what weight should you set? I did not have answers to these
> > >    questions.
> > 
> > I am not really sure I follow you. Are you talking about cpuset
> > nodemask changes or memory hotplug here.
> >
> 
> Actually both - slightly different context.
> 
> If the weights are implemented in mempolicy, if the cpuset nodemask
> changes then the mempolicy nodemask changes with it.
> 
> If the node is removed from the system, I believe (need to validate
> this, but IIRC) the node will be removed from any registered cpusets.
> As a result, that falls down to mempolicy, and the node is removed.

I do not think we do anything like that. Userspace might decide to
change the numa mask when a node is offlined but I do not think we do
anything like that automagically.

> Not entirely sure what happens if a node is added.  The only case where
> I think that is relevant is when cpuset is empty ("all") and mempolicy
> is set to something like `--interleave=all`.  In this case, it's
> possible that the new node will simply have a default weight (1), and if
> weights are implemented in mempolicy only there is no recourse for changing
> it.

That is what I would expect.

[...]
> > Right. This is understood. My main concern is whether this is outweights
> > the limitations of having a _global_ policy _only_. Historically a single
> > global policy usually led to finding ways how to make that more scoped
> > (usually through cgroups).
> >
> 
> Maybe the answer here is put it in cgroups + mempolicy, and don't handle
> hotplug?  This is an easy shift my this patch to cgroups, and then
> pulling my syscall patch forward to add weights directly to mempolicy.

Moving the global policy to cgroups would make the main cocern of
different workloads looking for different policy less problamatic.
I didn't have much time to think that through but the main question is
how to sanely define hierarchical properties of those weights? This is
more of a resource distribution than enforcement so maybe a simple
inherit or overwrite (if you have a more specific needs) semantic makes
sense and it is sufficient.

> I think the interleave code stays pretty much the same, the only
> difference would be where the task gets the weight from:
> 
> if (policy->mode == WEIGHTED_INTERLEAVE)
>   weight = pol->weight[target_node]
> else
>    cgroups.get_weight(from_node, target_node)
> 
> ~Gregory

This is not as much about the code as it is about the proper interface
because that will get cast in stone once introduced. It would be really
bad to realize that we have a global policy that doesn't fit well and
have hard time to work it around without breaking anybody.
-- 
Michal Hocko
SUSE Labs