Re: [PATCH v6 00/12] mempolicy2, mbind2, and weighted interleave

Gregory Price <gregory.price@xxxxxxxxxxxx> · Wed, 10 Jan 2024 15:11:59 -0500

On Wed, Jan 10, 2024 at 02:11:39PM -0500, Andi Kleen wrote:
> > Weighted interleave is a new interleave policy intended to make use
> > of heterogeneous memory environments appearing with CXL.
> >
> > To implement weighted interleave with task-local weights, we need
> > new syscalls capable of passing a weight array. This is the
> > justification for mempolicy2/mbind2 - which are designed to be
> > extensible to capture future policies as well.
>
> I might be late to the party here, but it's not clear to me you really
> need the redesigned system calls. set_mempolicy has one argument left
> so it can be enhanced with a new pointer dependending on a bit in mode.
> For mbind() it already uses all arguments, but it has a flags argument.
>
> But it's unclear to me if a fully flexible weight array is really
> needed here anyways. Can some common combinations be encoded in flags instead?
> I assume it's mainly about some nodes getting preference depending on
> some attribute
>

(apologize for the re-send, I had a delivery failure on the list, want
to make sure it gets captured).

This is actually something I haven't written out in the RFC that I
probably should have: I'm also trying to make it so that a mempolicy
can be described in its entirety with a single syscall.

This cannot be done with the existing interfaces.
(see: the existence of set_mempolicy_home_node).

Likewise you cannot fetch the entire mempolicy configuration with a
single get_mempolicy() call.  Certainly if task-local weights exist,
there's no room left to add that on either.

You'd really like to know that the policy you set is not changed between
calls to multiple interfaces. Right now, if you want to twiddle bits in
the mempolicy (like home_node), the syscall does: (*new = *old)+change.

That's... not great.

So I did consider extending set_mempolicy() to allow you to twiddle the
weight of a given node, but I was considering another proposal in the
process: process_set_mempolicy and process_mbind.

These interfaces were proposed to allow mempolicy to be changed by an
external task. (This is presently not possible).

That's basically how we got to this current iteration.

Re: fully flexible weight array in the task

At the end of the day, this is really about bandwidth distribution. For
a reasonable set of workloads, the system-global settings should be
sufficient.  However, it was at least recommended that we also explore
task-local weights along the way - so here we are.

I'm certainly open to changing this, or even just dropping the task-local
weight requirement entirely, but I did want to consider the other issues
(above) in the process so we don't design ourselves into a corner if we
have to go there anyway.

> So if you add such a attribute, perhaps configurable in sysfs, and
> then have flags like give weight + 1 on attribute, give weight + 2 on
> attribute give weight + 4 on attribute. If more are needed there are more bits.
> That would be a much more compact and simpler interface.
>
> For set_mempolicy either add a flags argument or encode in mode.
>
> It would also shrink the whole patchkit dramatically and be less risky.

I'm certainly not against this idea.  It is less flexible, but it does make
it simpler.  Another limitation, however, is that you have to make multiple
syscalls if you want to change the weights of multiple nodes.

I wanted to avoid that kind of ambiguity.

If we don't think the external changing interfaces are realistic, then this
is all moot, I'm down for whatever design is feasible.

>
> You perhaps underestimate the cost and risk of designing complex
> kernel interfaces, it all has to be hardened audited fuzzed deployed etc.
>

Definitely not under any crazy impression that something like this is a
quick process.  Just iterating as I go and gathering feedback (I think
we're on version 4 or 5 of this idea, and v7 of this patch line :]).

I fully expect the initial chunk of (MPOL_WEIGHTED_INTERLEAVE + sysfs)
will be a merge candidate long before the task-local weights will be,
if only because, as you said, it's a much simpler extension and less risky.

I appreciate the feedback, happy to keep hacking away,
~Gregory