Re: [LSF/MM/BPF TOPIC] The future of memory tiering

John Hubbard <jhubbard@xxxxxxxxxx> · Mon, 1 May 2023 12:21:01 -0700

On 4/26/23 21:30, David Rientjes wrote:
> Hi everybody,
> 
> As requested, sending along a last minute topic suggestion for 
> consideration for LSF/MM/BPF 2023 :)
> 
> For a sizable set of emerging technologies, memory tiering presents one of 
> the most formidable challenges and exicting opportunities for the MM 
> subsystem today.
> 
> "Memory tiering" can mean many different things based on the user: from 
> traditional every day NUMA, to swap (to zswap), to NVDIMMs, to HBM, to 
> locally attached CXL memory, to memory borrowing over PCIe, to memory 
> pooling with disaggregation, and beyond.
> 
> Just as NUMA started out only being useful for the supercomputers, memory 
> tiering will likely evolve over the next five years to take on an 
> expanding set of use cases, and likely with rapidly increasing adoption 
> even beyond hyperscalers.
> 
> I think a discussion about memory tiering would be highly valuable.  A few 
> key questions that I think can drive this discussion:
> 
>  - What are the various form factors that must be supported as short-term 
>    goals as well as need to be supported 5+ years into the future?
> 
>  - What incremental changes need to be made on top of NUMA support to
>    fully support the wide range of use cases that will be coming?  (Is
>    memory tiering support built entirely upon NUMA?)
> 
>  - What is the minimum viable *default* support that the MM subsystem 
>    should provide for tiered configs?  What are the set of optimizations
>    that should be left to userspace or BPF to control?
> 
>  - What are the various page promotion technqiues that we must plan for
>    beyond traditional NUMA balancing that will allow us to exploit
>    hardware innovation?
> 
> (And I'm sure there are more topics of discussion that others would 
> readily add.  It would be great to have additional ideas in replies.)
> 
> A key challenge in all of this is to make memory tiering support in the 
> upstream kernel compatible with the roadmaps of various CPU vendors.  A 
> key goal is to ensure the end user benefits from all of this rapid 
> innovation with generalized support that is well abstracted and allows for 
> extensibility.
> 

Yes, this is an extremely relevant topic from our point of view, as Jason
already mentioned. I'm very interested in a system that works well in
the presence of highly capable devices that can handle replayable page
faults and can co-process along with the CPU. Eventually, the kernel
should, arguably, be more aware of what a GPU or smart NIC is doing with
both memory and (device) processor time.

I have lots of examples of that, and one of my favorites is the current
autonuma behavior: unmapping a lot of pages, and waiting for *CPU* page
faults, in order to decide which NUMA node those pages are best placed
on. Of course, if a GPU or other page-fault-capable device had those
pages mapped, then the MMU notifier callbacks will force the device to
unmap those pages, and then fault them in again, at a truly huge,
and unnecessary, performance cost.

Also, when thinking about designs, sometimes it helps to think about
memory from the perspective of these devices, just to kind of shake
up the mental model.

thanks,
-- 
John Hubbard
NVIDIA