Hi everybody, There is a lot of excitement around upcoming CXL type 3 memory expansion devices and their cost savings potential. As the industry starts to adopt this technology, one of the key components in strategic planning is how the upstream Linux kernel will support various tiered configurations to meet various user needs. I think it goes without saying that this is quite interesting to cloud providers as well as other hyperscalers :) I think this discussion would benefit from a collaborative approach between various stakeholders and interested parties. Reason being is that there are several different use cases the need different support models, but also because there is great incentive toward moving "with" upstream Linux for this support rather than having multiple parties bringing up their own stacks only to find that they are diverging from upstream rather than converging with it. I'm interested to learn if there is interest in forming a "Linux Memory Tiering Work Group" to share ideas, discuss multi-faceted approaches, and keep track of work items? Some recent discussions have proven that there is widespread interest in some very foundational topics for this technology such as: - Decoupling CPU balancing from memory balancing (or obsoleting CPU balancing entirely) + John Hubbard notes this would be useful for GPUs: a) GPUs have their own processors that are invisible to the kernel's NUMA "which tasks are active on which NUMA nodes" calculations, and b) Similar to where CXL is generally going, we have already built fully memory-coherent hardware, which include memory-only NUMA nodes. - In-kernel hot memory abstraction, informed by hardware hinting drivers (incl some architectures like Power10), usable as a NUMA Balancing backend for promotion and other areas of the kernel like transparent hugepage utilization - NUMA and memory tiering enlightenment for accelerators, such as for optimal use of GPU memory, extremely important for a cloud provider (hint hint :) - Asynchronous memory promotion independent of task_numa_fault() while considering the cost of page migration (due to identifying cold memory) It looks like there is already some interest in such a working group that would have a biweekly discussion of shared interests with the goal of accelerating design, development, testing, and division of work: Alistair Popple Aneesh Kumar K V Brian Morris Christoph Lameter Dan Williams Gregory Price Grimm, Jon Huang, Ying Johannes Weiner John Hubbard Zi Yan Specifically for the in-kernel hot memory abstraction topic, Google and Meta recently publushed an OCP base specification "Hyperscale CXL Tiered Memory Expander Specification" available at https://drive.google.com/file/d/1fFfU7dFmCyl6V9-9qiakdWaDr9d38ewZ/view?usp=drive_link that would be great to discuss. There is also on-going work in the CXL Consortium to standardize some of the abstractions for CXL 3.1. If folks are interested in this topic and your name doesn't appear above (I already got you :), please: - reply-all to this email to express interest and expand upon the list of topics above to represent additional areas of interest that should be included, *or* - email me privately to express interest to make sure you are included Perhaps I'm overly optimistic, but one thing that would be absolutely *amazing* would be if we all have a very clear and understandable vision for how Linux will support the wide variety of use cases, even before that work is fully implemented (or even designed), by LSF/MM/BPF 2024 time in May. Thanks!