Re: [LSF/MM/BPF TOPIC] Locally attached memory tiering

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Mon, 06 May 2024, David Rientjes wrote:

Hi all,

I think it would be very worthwhile to have a block set aside for
discussion on locally attached memory tiering extensions at LSF/MM/BPF


fyi Adam's proposal which touches on both cxl and tiering:

Primarily interested in discussing Linux enlightenment for CXL 1.1 and
later type-3 memory expansion devices (CXL.mem).  I think we could touch
on CXL 2.0 and later memory pooling architectures if we have time and
there is interest, but the primary focus here would be local attached.

Based on the premise for a Memory Tiering Working Group[1], there is
widespread interest in the foundational topics for generally useful Linux

- Decoupling CPU balancing from memory balancing (or obsoleting CPU
  balancing entirely)

  + John Hubbard notes this would be useful for GPUs:

     a) GPUs have their own processors that are invisible to the kernel's
        NUMA "which tasks are active on which NUMA nodes" calculations,

     b) Similar to where CXL is generally going, we have already built
        fully memory-coherent hardware, which include memory-only NUMA

+Cc peterz

- In-kernel hot memory abstraction, informed by hardware hinting drivers
  (incl some architectures like Power10), usable as a NUMA Balancing
  backend for promotion and other areas of the kernel like transparent
  hugepage utilization

- NUMA and memory tiering enlightenment for accelerators, such as for
  optimal use of GPU memory, extremely important for a cloud provider
  (hint hint :)

- Asynchronous memory promotion independent of task_numa_fault() while
  considering the cost of page migration (due to identifying cold memory)

This would be nice for users who like to disable NUMA balancing. But overall
when compared to anything hardware can give us (ala ppc, without the required
kernel overhead of x86-based counters), I fear that software solutions will
always be found wanting. And, afaik, numa balancing based promotion is still
one of the top pain points in memory tiering.

So, of course, improving the software approach is still a good thing. Fyi
along these lines, improving/optimizing the current numa balancing approach
has proven irrelevant in the larger scale of benchmarks, afaik. For example
(active) LRU based promotion instead of blindly promoting the faulting page
which could be rarely used. Benchmarks shows significant reduction in a lot
of the promote/demote traffic dealing with ping pong cases, but unfortunately
show little to no tangible performance wins in actual benchmark numbers.
Similarly, the proposed migrc[1] which shows great TLB flushing benefits but
minimal benchmark (XSBench) improvement.

... which brings me to the topic of benchmarking. What are the workloads
people care about, beyond pmbench? I tend to use oltp based database workloads
with wss/buffers larger than the total amount of fast memory nodes.

- What the role of userspace plays in this decision-making and how we can
  extend the default policy and mechanisms in the kernel to allow for it
  if necessary

Additional topics that you find interesting are also very helpful!

I'm biased toward a generally useful solution that would leverage the
kernel as the ultimate source of truth for page hotness that can be
extended for multiple use caes, one of which is memory tiering support.
But certainly if there are other approaches, we can discuss that as well.

A few main goals from this discussion:

- Ensure that proposals address, or can be extended to address, the
  emerging needs of the various use cases that users may have

- Surface any constraints that stakeholders may find to be prohibitive
  for support in the core MM subsystem

- Alignment and division of work for developers who are actively looking
  to contribute to this area

As I'm just one of many stakeholders for this discussion, I'd nominate
Michal Hocko to moderate it if he's willing to do so.  If he's so willing,
we'd be in good hands :)




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux