Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering

David Rientjes <rientjes@xxxxxxxxxx> · Thu, 9 May 2024 20:10:07 -0700 (PDT)

On Wed, 8 May 2024, Huang, Ying wrote:

> > Hi all,
> >
> > I think it would be very worthwhile to have a block set aside for 
> > discussion on locally attached memory tiering extensions at LSF/MM/BPF 
> > 2024.
> >
> > Primarily interested in discussing Linux enlightenment for CXL 1.1 and 
> > later type-3 memory expansion devices (CXL.mem).  I think we could touch 
> > on CXL 2.0 and later memory pooling architectures if we have time and 
> > there is interest, but the primary focus here would be local attached.
> >
> > Based on the premise for a Memory Tiering Working Group[1], there is 
> > widespread interest in the foundational topics for generally useful Linux 
> > enlightenment:
> >
> >  - Decoupling CPU balancing from memory balancing (or obsoleting CPU
> >    balancing entirely)
> >
> >    + John Hubbard notes this would be useful for GPUs:
> >
> >       a) GPUs have their own processors that are invisible to the kernel's
> >          NUMA "which tasks are active on which NUMA nodes" calculations,
> >          and
> >
> >       b) Similar to where CXL is generally going, we have already built
> >          fully memory-coherent hardware, which include memory-only NUMA
> >          nodes.
> >
> >  - In-kernel hot memory abstraction, informed by hardware hinting drivers
> >    (incl some architectures like Power10), usable as a NUMA Balancing
> >    backend for promotion and other areas of the kernel like transparent
> >    hugepage utilization
> >
> >  - NUMA and memory tiering enlightenment for accelerators, such as for
> >    optimal use of GPU memory, extremely important for a cloud provider
> >    (hint hint :)
> >
> >  - Asynchronous memory promotion independent of task_numa_fault() while
> >    considering the cost of page migration (due to identifying cold memory)
> >
> >  - What the role of userspace plays in this decision-making and how we can
> >    extend the default policy and mechanisms in the kernel to allow for it
> >    if necessary
> >
> > Additional topics that you find interesting are also very helpful!
> 
> In addition to the hot memory identification and promotion, I think that
> we should consider the cold memory identification and demotion too as a
> full solution.  The existing method based on the page table accessed bit
> may be good enough, but we still need to consider the full solution in
> the context of the general NUMA balancing.
> 

I think that's a great suggestion!  We'll be able to cover the approach 
taken by workingset reporting[*] which is quite powerful for the purposes 
of proactive reclaim through memory.reclaim and would also very be useful 
for identifying cold memory for the purposes of demotion as well.

 [*] https://lore.kernel.org/linux-mm/20240504073011.4000534-1-yuanchu@xxxxxxxxxx/T/

> > I'm biased toward a generally useful solution that would leverage the 
> > kernel as the ultimate source of truth for page hotness that can be 
> > extended for multiple use caes, one of which is memory tiering support.  
> > But certainly if there are other approaches, we can discuss that as well.
> >
> > A few main goals from this discussion:
> >
> >  - Ensure that proposals address, or can be extended to address, the 
> >    emerging needs of the various use cases that users may have
> >
> >  - Surface any constraints that stakeholders may find to be prohibitive
> >    for support in the core MM subsystem
> >
> >  - Alignment and division of work for developers who are actively looking
> >    to contribute to this area
> >
> > As I'm just one of many stakeholders for this discussion, I'd nominate 
> > Michal Hocko to moderate it if he's willing to do so.  If he's so willing, 
> > we'd be in good hands :)
> >
> >  [1] https://lore.kernel.org/linux-mm/45d850ec-623b-7c07-c266-e948cdbf1f62@xxxxxxxxx/T/
> 
> --
> Best Regards,
> Huang, Ying
>