[LSF/MM/BPF TOPIC] mm/mempolicy extentions for heterogeneous memory systems

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 27 Feb 2024 02:28:21 -0500

Commodity heterogeneous memory systems are emerging as CXL memory
devices approach general availability.  With it, a number of
proposals have been made to extend the mm/mempolicy component to
better manage these disparate resources.

Below are a collection of topics that have been discussed with
some level of seriousness on-list, at-conference, or in passing.

Some have reference patches, others are half-baked.  Most bring
up interesting implications of heterogeneous memory systems and
opportunities to improve performance (or royally mess it up!
It's a choose your own adventure!)

I'd like to propose a discussion of the environment, some of
the issues surrounding each problems, and a general call for
new ideas - considering that the heterogeneous memory world
is still evolving.

Sub-Topics:
===
Weighted Interleave Mempolicy

This policy allows N:M interleave ratios between two or more nodes
which allows better utilization of overall bandwidth.

Right now the continued discussion is about how to generate
default values for this system, and whether/how much topology
information should be passed down into mempolicy.

Prior versions of this patch set also included proposals for
task-local interleave weights, which would necessitate the
introduction of set_mempolicy2/mbind2.  These were dropped
until an explicit use-case emerged.

https://lore.kernel.org/all/20240202170238.90004-1-gregory.price@xxxxxxxxxxxx/
https://lore.kernel.org/all/20240220202529.2365-2-gregory.price@xxxxxxxxxxxx/

====
process_mbind system call

The basic ideas is to allow an external process the ability
to modify the mempolicy of a VMA owned by another task via
a pidfd-esque syscall.

The current issue with this feature is the `current-centric`
nature of the mempolicy component and difficulty refactoring
mempolicy to make this possible.

Progress on this feature has stumbled, but there is still some
quiet chittering about this idea in some corners.

https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@xxxxxxxxxxxxx/
https://lore.kernel.org/all/ZV5zGROLefrsEcHJ@xxxxxxxxxxxxxxxxxx/

===
Allowing migrate-on-mbind to apply interleave on new node set

As of today, when mbind'ing a memory region using an interleave
policy, migrations are only made on a per-node granularity.

For example if you mbind() a VMA which has already allocated
100% of its pages on node 1, an attempt to apply an interleave
on nodes 1&2 - no migrations will occur.  Migrations only occur
if a node which was present previously is no longer present.

There has been some discussion (though no patches, yet) about
whether it is feasible to extend migrations to support
"re-interleaving" a VMA given the new mempolicy.

Combined with something like process_mbind, this would be used
to rebalance memory - although it could be incredibly disruptive
to the performance of the software being rebound.

===
Cgroups (or cgroups-like?) extension for mempolicy

This comes up in just about every conversation I have been
privy to with mempolicy.  "Why doesn't cgroups allow you to
apply a mempolicy which applies to all tasks in the cgroup?"

Typical answers:
- What does that look like hierarchically?
- Does inheritance even make sense here?
- Is that a memcg thing or its own new cgroups component?
- Cgroups is not for that.

The use case is somewhat straight forward. Users wish to be
able to apply a mempolicy to a set of tasks in a cgroup, and
have migrated tasks inherit the new cgroup's policy.

Is there any way forward on a feature like this, cgroups or
otherwise? It does seem legimiately useful.

===
Flexibility on how policies apply to VMAs within a task

One of the frustrations with mempolicy is that you operate
either on a task-wide level, or on a per-vma level.  This
means that you either apply a given policy to all VMAs, or
software must become "numa-aware" enough to apply individual
policies to individual VMAs.

Is something more flexible possible?

For example:

Is it possible to have a policy that applies only to all anonymous
memory regions?

Is it possible to explicitly avoid stack and executable regions
when deciding whether to invoke task-mempolicy logic?

These are big, non-trivial questions.  Here is a single datum to help
explain the significance:

Stream Benchmark w/ weighted interleave
(vs DRAM, 1 Socket + 1 CXL Device)
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
Targeted weights   : +2.5% to +4% (consistently better than DRAM)

In the 'targeted weights' scenario, the benchmark was made
numa-aware enough to mbind its bandwidth-driving memory regions
with a weighted-interleave policy.  However, we can see when
using task-mempolicy, the outcomes were more varied.

The question is whether it might be possible to make the targeted
scenario more accessible from the global interface by avoiding
obvious pitfalls (like placing code pages on far memory).

Of course this can only go so far. If a mutex in anonymous memory
ends up on a node 3 hops away... well, you're going to be a sad
puppy.  However, it's clear there are at least some scenarios
where this slightly more granual control could be useful.

===

Are you interested in these topics? Have a topic to add?
Think numa should die in a fire? Fair game, fire away.

Kind Regards,
~Gregory Price