Commodity heterogeneous memory systems are emerging as CXL memory devices approach general availability. With it, a number of proposals have been made to extend the mm/mempolicy component to better manage these disparate resources. Below are a collection of topics that have been discussed with some level of seriousness on-list, at-conference, or in passing. Some have reference patches, others are half-baked. Most bring up interesting implications of heterogeneous memory systems and opportunities to improve performance (or royally mess it up! It's a choose your own adventure!) I'd like to propose a discussion of the environment, some of the issues surrounding each problems, and a general call for new ideas - considering that the heterogeneous memory world is still evolving. Sub-Topics: === Weighted Interleave Mempolicy This policy allows N:M interleave ratios between two or more nodes which allows better utilization of overall bandwidth. Right now the continued discussion is about how to generate default values for this system, and whether/how much topology information should be passed down into mempolicy. Prior versions of this patch set also included proposals for task-local interleave weights, which would necessitate the introduction of set_mempolicy2/mbind2. These were dropped until an explicit use-case emerged. https://lore.kernel.org/all/20240202170238.90004-1-gregory.price@xxxxxxxxxxxx/ https://lore.kernel.org/all/20240220202529.2365-2-gregory.price@xxxxxxxxxxxx/ ==== process_mbind system call The basic ideas is to allow an external process the ability to modify the mempolicy of a VMA owned by another task via a pidfd-esque syscall. The current issue with this feature is the `current-centric` nature of the mempolicy component and difficulty refactoring mempolicy to make this possible. Progress on this feature has stumbled, but there is still some quiet chittering about this idea in some corners. https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@xxxxxxxxxxxxx/ https://lore.kernel.org/all/ZV5zGROLefrsEcHJ@xxxxxxxxxxxxxxxxxx/ === Allowing migrate-on-mbind to apply interleave on new node set As of today, when mbind'ing a memory region using an interleave policy, migrations are only made on a per-node granularity. For example if you mbind() a VMA which has already allocated 100% of its pages on node 1, an attempt to apply an interleave on nodes 1&2 - no migrations will occur. Migrations only occur if a node which was present previously is no longer present. There has been some discussion (though no patches, yet) about whether it is feasible to extend migrations to support "re-interleaving" a VMA given the new mempolicy. Combined with something like process_mbind, this would be used to rebalance memory - although it could be incredibly disruptive to the performance of the software being rebound. === Cgroups (or cgroups-like?) extension for mempolicy This comes up in just about every conversation I have been privy to with mempolicy. "Why doesn't cgroups allow you to apply a mempolicy which applies to all tasks in the cgroup?" Typical answers: - What does that look like hierarchically? - Does inheritance even make sense here? - Is that a memcg thing or its own new cgroups component? - Cgroups is not for that. The use case is somewhat straight forward. Users wish to be able to apply a mempolicy to a set of tasks in a cgroup, and have migrated tasks inherit the new cgroup's policy. Is there any way forward on a feature like this, cgroups or otherwise? It does seem legimiately useful. === Flexibility on how policies apply to VMAs within a task One of the frustrations with mempolicy is that you operate either on a task-wide level, or on a per-vma level. This means that you either apply a given policy to all VMAs, or software must become "numa-aware" enough to apply individual policies to individual VMAs. Is something more flexible possible? For example: Is it possible to have a policy that applies only to all anonymous memory regions? Is it possible to explicitly avoid stack and executable regions when deciding whether to invoke task-mempolicy logic? These are big, non-trivial questions. Here is a single datum to help explain the significance: Stream Benchmark w/ weighted interleave (vs DRAM, 1 Socket + 1 CXL Device) Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) Targeted weights : +2.5% to +4% (consistently better than DRAM) In the 'targeted weights' scenario, the benchmark was made numa-aware enough to mbind its bandwidth-driving memory regions with a weighted-interleave policy. However, we can see when using task-mempolicy, the outcomes were more varied. The question is whether it might be possible to make the targeted scenario more accessible from the global interface by avoiding obvious pitfalls (like placing code pages on far memory). Of course this can only go so far. If a mutex in anonymous memory ends up on a node 3 hops away... well, you're going to be a sad puppy. However, it's clear there are at least some scenarios where this slightly more granual control could be useful. === Are you interested in these topics? Have a topic to add? Think numa should die in a fire? Fair game, fire away. Kind Regards, ~Gregory Price