Re: RE: RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

Gregory Price <gregory.price@xxxxxxxxxxxx> · Fri, 31 Mar 2023 11:53:50 -0400

On Fri, Mar 31, 2023 at 08:34:17PM +0900, Kyungsan Kim wrote:
> Hi Gregory Price. 
> Thank you for joining this topic and share your viewpoint.
> I'm sorry for late reply due to some major tasks of our team this week.
> 
> >On Fri, Mar 24, 2023 at 05:48:08PM +0900, Kyungsan Kim wrote:
> >> 
> >> Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE.
> >> However, we think it would be a bad practice for the 2 reasons.
> >> 1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. 
> >> 2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention.
> >> 
> >> Allocating a kernel context out of ZONE_EXMEM is unmovable.
> >>   a kernel context -  alloc_pages(GFP_EXMEM,)
> >
> >What is the specific use case of this?  If the answer is flexibility in
> >low-memory situations, why wouldn't the kernel simply change to free up
> >ZONE_NORMAL (swapping user memory, migrating user memory, etc) and
> >allocate as needed?
> >
> >I could see allocating kernel memory from local memory expanders
> >(directly attached to local CXL port), but I can't think of a case where
> >it would be preferable for kernel resources to live on remote memory.
> 
> We have thought kernelspace memory tiering cases.
> What memory tiering we assumes is to locate a hot data in fast memory and a cold data in slow memory.
> We think zswap, pagecache, and Meta TPP(page promotion/demotion among nodes) as the kernelspace memory tiering cases.
>

So, to clarify, when you say "kernel space memory tiering cases", do you
mean "to support a kernel-space controlled memory tiering service" or do
you mean "tiering of kernel memory"?

Because if it's the former, rather than a new zone, it seems like a
better proposal would be to extend the numa system to add additional
"cost/feature" attributes, rather than modifying the zone of the memory
blocks backing the node.

Note that memory zones can apply to individual blocks within a node, and
not the entire node uniformly.  So when making tiering decisions, it
seems more expedient to investigate a node rather than a block.

> >Since local memory expanders are static devices, there shouldn't be a
> >great need for hotplug, which means the memory could be mapped
> >ZONE_NORMAL without issue.
> >
> 
> IMHO, we think hot-add/remove is one of the key feature of CXL due to the composability aspect.
> Right now, CXL device and system connection is limited. 
> But industry is preparing a CXL capable system that allows more than 10 CXL channels at backplane, pluggable with EDSFF. 
> Not only that, along with the progress of CXL topology - from direct-attached to switch, multi-level switch, and fabric connection -
> I think the hot-add/remove usecase would become more important.
> 
> 

Hot add/remove is somewhat fairly represented by ZONE_MOVABLE. What's I
think confusing many people is that creating a new zone that's intended
to be hot-pluggable *and* usable by kernel for kernel-resources/memory
are presently exclusive operations.

The underlying question is what situation is being hit in which kernel
memory wants to be located in ZONE_MOVABLE/ZONE_EXMEM that cannot simply
be serviced by demoting other, movable memory to these regions.

The concept being that kernel allocations are a higher-priority
allocation than userland, and as such should have priority in DRAM.

For example - there is at least one paper that examined the cost of
placing page tables on CXL Memory Expansion (on the local CXL complex,
not remote) and found the cost is significant.  Page tables are likely
the single largest allocation the kernel will make to service large
memory structures, so the answer to this problem is not necessarily to
place that memory in CXL as well, but to use larger page sizes (which is
less wasteful as memory usage is high and memory is abundant).

I just don't understand what kernel resources would meet the following
attributes:

1) Do not have major system performance impacts in high-latency memory
2) Are sufficiently large to warrant tiering
and
3) Are capable of being moved (i.e. no pinned areas, no dma areas, etc)

> >> Allocating a user context out of ZONE_EXMEM is movable.
> >>   a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,)
> >> This is how ZONE_EXMEM supports the two cases.
> >> 

So if MAP_EXMEM is not used, EXMEM would not be used?

That seems counter intuitive.  If an allocation via mmap would be
eligible for ZONE_MOVABLE, why wouldn't it be eligible for ZONE_EXMEM?

I believe this is another reason why some folks are confused what the
distinction between MOVABLE and EXMEM are.  They seem to ultimately
reduce to whether the memory can be moved.

> >
> >Is it intended for a user to explicitly request MAP_EXMEM for it to get
> >used at all?  As in, if i simply mmap() without MAP_EXMEM, will it
> >remain unutilized?
> 
> Our intention is to allow below 3 cases
> 1. Explicit DDR allocation - mmap(,,MAP_NORMAL,)
>  : allocation from ZONE_NORMAL or ZONE_MOVABLE, or allocation fails.
> 2. Explicit CXL allocation - mmap(,,MAP_EXMEM,)
>  : allocation from ZONE_EXMEM, of allocation fails.
> 3. Implicit Memory allocation - mmap(,,,) 
>  : allocation from ZONE_NORMAL, ZONE_MOVABLE, or ZONE_EXMEM. In other words, no matter where DDR or CXL DRAM.
> 
> Among that, 3 is similar with vanilla kernel operation in that the allocation request traverses among multiple zones or nodes.
> We think it would be good or bad for the mmap caller point of view.
> It is good because memory is allocated, while it could be bad because the caller does not have idea of allocated memory type.
> The later would hurt QoS metrics or userspace memory tiering operation, which expects near/far memory.
> 

For what it's worth, mmap is not the correct api for userland to provide
kernel hints on data placement.  That would be madvise and friends.

But further, allocation of memory from userland must be ok with having
its memory moved/swapped/whatever unless additional assistance from the
kernel is provided (page pinning, mlock, whatever) to ensure it will
not be moved.  Presumably this is done to ensure the kernel can make
runtime adjustments to protect itself from being denied memory and
causing instability and/or full system faults.

I think you need to clarify your intents for this zone, in particular
your intent for exactly what data can and cannot live in this zone and
the reasons for this.  "To assist kernel tiering operations" is very
vague and not a description of what memory is and is not allowed in the
zone.

~Gregory