Re: [LSF/MM/BPF TOPIC] Restricting or migrating unmovable kernel allocations from slow tier

Dan Williams <dan.j.williams@xxxxxxxxx> · Mon, 3 Feb 2025 14:09:26 -0800

Gregory Price wrote:
> On Sun, Feb 02, 2025 at 12:13:23AM +0900, Hyeonggon Yoo wrote:
> > On Sat, Feb 1, 2025 at 11:04 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > This all seems like a grand waste of time.  Don't do that.  Don't allow
> > > kernel allocations from CXL at all. Don't build systems that have
> > > vast quantities of CXL memory (or if you do, expose it as really fast
> > > swap, not as memory).
> > >
> > 
> > Hi, Matthew. Thank you for sharing your opinion.
> > 
> > I don't want to introduce too much complexity to MM due to CXL madness either,
> > but I think at least we need to guide users who buy CXL hardware to avoid
> > doing stupid things.
> > 
> > My initial subject was "Clearly documenting the use cases of
> > memhp_default_state=online{,_kernel}" because at first glance,
> > it was deemed usable for allowing kernel allocations from CXL,
> > which turned out to be not after some evaluation.
> >
> 
> This was the motivation for implementing the build-time switch for
> memhp_default_state.  Distros and builders can now have flexibility
> to make this their default policy for hotplug memory blocks.
> 
> https://lore.kernel.org/linux-mm/20241226182918.648799-1-gourry@xxxxxxxxxx/
> 
> I don't normally agree with Willy's hard takes on CXL, but I do agree
> that it's generally not fit for kernel use - and I share general skepticism
> that movement-based tiering is fundamentally better than reclaim/swap
> semantics (though I have been convinced otherwise in some scenarios,
> and I think some clear performance benefits in many scenarios are lost
> by treating it as super-fast-swap).

It is also the case that CXL topologies enumerate their performance
characteristics, "CXL" is not a latency characteristic unto itself.

For example, like "PCI", "CXL" by itself does not imply a performance
profile. You could have CPU attached DDR that presents as a "CXL"
enumerated device just to take advantage of now standardized RAS
interfaces.

Unless and until this whole heteorgeneous memory experiment fails all
the kernel can do is give userspace the ability to include/exclude
memory ranges that are marked as outside the default pool. That is what
EFI_MEMORY_SP is all about, to set aside: too precious for the default
pool => HBM, or too slow for the default pool => potentially CXL and
PMEM.

A kernel default policy, or better yet distibution policy, that more
aggressively excludes CXL memory based on its relative performance to
the default pool would be a welcome improvement.

> Rather than ask whether we can make portions of the kernel more ammenable
> to movable allocations, I think it's more beneficial to focus on whether
> we can reduce the ZONE_NORMAL cost of ZONE_MOVABLE capacity. That seems
> (to me) like the actual crux of this particular issue.

Yes, I like this line of thinking. Even if CXL attached memory struggles
to graduate out of cold-memory tier use cases, that struggle can yield
other general improvements that are welcome indepdendent of CXL.