[RFC] Memory tiering kernel alignment

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 25 Jan 2024 10:26:19 -0800 (PST)

Hi everybody,

There is a lot of excitement around upcoming CXL type 3 memory expansion
devices and their cost savings potential.  As the industry starts to
adopt this technology, one of the key components in strategic planning is
how the upstream Linux kernel will support various tiered configurations
to meet various user needs.  I think it goes without saying that this is
quite interesting to cloud providers as well as other hyperscalers :)

I think this discussion would benefit from a collaborative approach
between various stakeholders and interested parties.  Reason being is
that there are several different use cases the need different support
models, but also because there is great incentive toward moving "with"
upstream Linux for this support rather than having multiple parties
bringing up their own stacks only to find that they are diverging from
upstream rather than converging with it.

I'm interested to learn if there is interest in forming a "Linux Memory
Tiering Work Group" to share ideas, discuss multi-faceted approaches, and
keep track of work items?

Some recent discussions have proven that there is widespread interest in
some very foundational topics for this technology such as:

 - Decoupling CPU balancing from memory balancing (or obsoleting CPU
   balancing entirely)

   + John Hubbard notes this would be useful for GPUs:

      a) GPUs have their own processors that are invisible to the kernel's
         NUMA "which tasks are active on which NUMA nodes" calculations,
         and

      b) Similar to where CXL is generally going, we have already built
         fully memory-coherent hardware, which include memory-only NUMA
         nodes.

 - In-kernel hot memory abstraction, informed by hardware hinting drivers
   (incl some architectures like Power10), usable as a NUMA Balancing
   backend for promotion and other areas of the kernel like transparent
   hugepage utilization

 - NUMA and memory tiering enlightenment for accelerators, such as for
   optimal use of GPU memory, extremely important for a cloud provider
   (hint hint :)

 - Asynchronous memory promotion independent of task_numa_fault() while
   considering the cost of page migration (due to identifying cold memory)

It looks like there is already some interest in such a working group that
would have a biweekly discussion of shared interests with the goal of
accelerating design, development, testing, and division of work:

Alistair Popple
Aneesh Kumar K V
Brian Morris
Christoph Lameter
Dan Williams
Gregory Price
Grimm, Jon
Huang, Ying
Johannes Weiner
John Hubbard
Zi Yan

Specifically for the in-kernel hot memory abstraction topic, Google and
Meta recently publushed an OCP base specification "Hyperscale CXL Tiered
Memory Expander Specification" available at
https://drive.google.com/file/d/1fFfU7dFmCyl6V9-9qiakdWaDr9d38ewZ/view?usp=drive_link
that would be great to discuss.

There is also on-going work in the CXL Consortium to standardize some of
the abstractions for CXL 3.1.

If folks are interested in this topic and your name doesn't appear above
(I already got you :), please:

 - reply-all to this email to express interest and expand upon the list
   of topics above to represent additional areas of interest that should
   be included, *or*

 - email me privately to express interest to make sure you are included

Perhaps I'm overly optimistic, but one thing that would be absolutely
*amazing* would be if we all have a very clear and understandable vision
for how Linux will support the wide variety of use cases, even before
that work is fully implemented (or even designed), by LSF/MM/BPF 2024
time in May.

Thanks!