Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <gregory.price@xxxxxxxxxxxx> · Fri, 10 Nov 2023 22:42:39 -0500

On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@xxxxxxxxxx wrote:
> Hello, Gregory.
> 
> On Fri, Nov 10, 2023 at 05:29:25PM -0500, Gregory Price wrote:
> > Unfortunately mpol has no way of being changed from outside the task
> > itself once it's applied, other than changing its nodemasks via cpusets.
> 
> Maybe it's time to add one?
> 

I've been considering this as well, but there's more context here being
lost.  It's not just about being able to toggle the policy of a single
task, or related tasks, but actually in support of a more global data
interleaving strategy that makes use of bandwidth more effectively as
we begin to memory expansion and bandwidth expansion occur on the
PCIE/CXL bus.

If the memory landscape of a system changes, for example due to a
hotplug event, you actually want to change the behavior of *every* task
that is using interleaving.  The fundamental bandwidth distribution of
the entire system changed, so the behavior of every task using that
memory should change with it.

We've explored adding weights to: mempolicy, memory tiers, nodes, memcg,
and now additionally cpusets. In the last email, I'd asked whether it
might actually be worth adding a new mpol component of cgroups to
aggregate these issues, rather than jam them into either component.
I would love your thoughts on that.

> > So one concrete use case: kubernetes might like change cpusets or move
> > tasks from one cgroup to another, or a vm might be migrated from one set
> > of nodes to enother (technically not mutually exclusive here).  Some
> > memory policy settings (like weights) may no longer apply when this
> > happens, so it would be preferable to have a way to change them.
> 
> Neither covers all use cases. As you noted in your mempolicy message, if the
> application wants finer grained control, cgroup interface isn't great. In
> general, any changes which are dynamically initiated by the application
> itself isn't a great fit for cgroup.
> 

It is certainly simple enough to add weights to mempolicy, but there
are limitations.  In particular, mempolicy is extremely `current task`
focused, and significant refactor work would need to be done to allow
external tasks the ability to toggle a target task's mempolicy.

In particular I worry about the potential concurrency issues since
mempolicy can be in the hot allocation path.

(additionally, as you note below, you would have to hit every child
thread separately to make effective changes, since it is per-task).

I'm not opposed to this, but it was suggested to me that maybe there is
a better place to place these weights.  Maybe it can be managed mostly
through RCU, though, so maybe the concern is overblow.

Anyway...

It's certainly my intent to add weights to mempolicy, as that's where
I started. If that is the preferred starting point from the perspective
of the mm community, I will revert to proposing set_mempolicy2 and/or
full on converting mempolicy into a sys/procfs friendly mechanism.

The goal here is to enable mempolicy, or something like it, to acquire
additional flexibility in a heterogeneous memory world, considering how
threads may be migrated, checkpointed/restored, and how use cases like
bandwidth expansion may be insufficiently serviced by something as fine
grained as per-task mempolicies.

> I'm generally pretty awry of adding non-resource group configuration
> interface especially when they don't have counter part in the regular
> per-process/thread API for a few reasons:
> 
> 1. The reason why people try to add those through cgroup somtimes is because
>    it seems easier to add those new features through cgroup, which may be
>    true to some degree, but shortcuts often aren't very conducive to long
>    term maintainability.
>

Concur.  That's why i originally proposed the mempolicy extension, since
I wasn't convinced by global settings, but I've been brought around by
the fact that migrations and hotplug events may want to affect mass
changes across a large number of unrelated tasks.

> 2. As noted above, just having cgroup often excludes a signficant portion of
>    use cases. Not all systems enable cgroups and programatic accesses from
>    target processes / threads are coarse-grained and can be really awakward.
> 
> 3. Cgroup can be convenient when group config change is necessary. However,
>    we really don't want to keep adding kernel interface just for changing
>    configs for a group of threads. For config changes which aren't high
>    frequency, userspace iterating the member processes and applying the
>    changes if possible is usually good enough which usually involves looping
>    until no new process is found. If the looping is problematic, cgroup
>    freezer can be used to atomically stop all member threads to provide
>    atomicity too.
> 

If I can ask, do you think it would be out of line to propose a major
refactor to mempolicy to enable external task's the ability to change a
running task's mempolicy *as well as* a cgroup-wide mempolicy component?

As you've alluded to here, I don't think either mechanism on their own
is sufficient to handle all use cases, but the two combined does seem
sufficient.

I do appreciate the feedback here, thank you.  I think we are getting
to the bottom of how/where such new mempolicy mechanisms should be
implemented.

Gregory: