Re: [RFC] Memory tiering kernel alignment

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 25 Jan 2024 13:37:02 -0800 (PST)

On Thu, 25 Jan 2024, Matthew Wilcox wrote:

> On Thu, Jan 25, 2024 at 12:04:37PM -0800, David Rientjes wrote:
> > On Thu, 25 Jan 2024, Matthew Wilcox wrote:
> > > On Thu, Jan 25, 2024 at 10:26:19AM -0800, David Rientjes wrote:
> > > > There is a lot of excitement around upcoming CXL type 3 memory expansion
> > > > devices and their cost savings potential.  As the industry starts to
> > > > adopt this technology, one of the key components in strategic planning is
> > > > how the upstream Linux kernel will support various tiered configurations
> > > > to meet various user needs.  I think it goes without saying that this is
> > > > quite interesting to cloud providers as well as other hyperscalers :)
> > > 
> > > I'm not excited.  I'm disappointed that people are falling for this scam.
> > > CXL is the ATM of this decade.  The protocol is not fit for the purpose
> > > of accessing remote memory, adding 10ns just for an encode/decode cycle.
> > > Hands up everybody who's excited about memory latency increasing by 17%.
> > 
> > Right, I don't think that anybody is claiming that we can leverage locally 
> > attached CXL memory as through it was DRAM on the same or remote socket 
> > and that there won't be a noticable impact to application performance 
> > while the memory is still across the device.
> > 
> > It does offer several cost savings benefits for offloading of cold memory, 
> > though, if locally attached and I think the support for that use case is 
> > inevitable -- in fact, Linux has some sophisticated support for the 
> > locally attached use case already.
> > 
> > > Then there are the lies from the vendors who want you to buy switches.
> > > Not one of them are willing to guarantee you the worst case latency
> > > through their switches.
> > 
> > I should have prefaced this thread by saying "locally attached CXL memory 
> > expansion", because that's the primary focus of many of the folks on this 
> > email thread :)
> 
> That's a huge relief.  I was not looking forward to the patches to add
> support for pooling (etc).
> 
> Using CXL as cold-data-storage makes a certain amount of sense, although
> I'm not really sure why it offers an advantage over NAND.  It's faster
> than NAND, but you still want to bring it back locally before operating
> on it.  NAND is denser, and consumes less power while idle.  NAND comes
> with a DMA controller to move the data instead of relying on the CPU to
> move the data around.  And of course moving the data first to CXL and
> then to swap means that it's got to go over the memory bus multiple
> times, unless you're building a swap device which attaches to the
> other end of the CXL bus ...
> 

This is **exactly** the type of discussion we're looking to have :)

There are some things that I've chatted informally with folks about that 
I'd like to bring to the forum:

 - Decoupling CPU migration from memory migration for NUMA Balancing (or
   perhaps deprecating CPU migration entirely)

 - Allowing NUMA Balancing to do migration as part of a kthread 
   asynchronous to the NUMA hint fault, in kernel context

 - Abstraction for future hardware devices that can provide an expanded
   view into page hotness that can be leveraged in different areas of the
   kernel, including as a backend for NUMA Balanacing to replace NUMA
   hint faults

 - Per-container support for configuring balancing and memory migration

 - Opting certain types of memory into NUMA Balancing (like tmpfs) while
   leaving other types alone

 - Utilizing hardware accelerated memory migration as a replacement for
   the traditional migrate_pages() path when available

I could go code all of this up and spend an enormous amount of time doing 
so only to get NAKed by somebody because I'm ripping out their critical 
use case that I just didn't know about :)  There's also the question of 
whether DAMON should be the source of truth for this or it should be 
decoupled.

My dream world would be where we could discuss various use cases for 
locally attached CXL memory and determine, as a group, what the shared, 
comprehensive "Linux vision" for it is and do so before LSF/MM/BPF.  In a 
perfect world, we could block out an expanded MM session in Salt Lake City 
to bring all these concepts together, what approaches sound reasonable vs 
unreasonable, and leave that conference with a clear understanding of what 
needs to happen.