RE: RE: RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

Kyungsan Kim <ks0204.kim@xxxxxxxxxxx> · Fri, 31 Mar 2023 20:34:17 +0900

Hi Gregory Price. 
Thank you for joining this topic and share your viewpoint.
I'm sorry for late reply due to some major tasks of our team this week.

>On Fri, Mar 24, 2023 at 05:48:08PM +0900, Kyungsan Kim wrote:
>> 
>> Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE.
>> However, we think it would be a bad practice for the 2 reasons.
>> 1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. 
>> 2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention.
>> 
>> Allocating a kernel context out of ZONE_EXMEM is unmovable.
>>   a kernel context -  alloc_pages(GFP_EXMEM,)
>
>What is the specific use case of this?  If the answer is flexibility in
>low-memory situations, why wouldn't the kernel simply change to free up
>ZONE_NORMAL (swapping user memory, migrating user memory, etc) and
>allocate as needed?
>
>I could see allocating kernel memory from local memory expanders
>(directly attached to local CXL port), but I can't think of a case where
>it would be preferable for kernel resources to live on remote memory.

We have thought kernelspace memory tiering cases.
What memory tiering we assumes is to locate a hot data in fast memory and a cold data in slow memory.
We think zswap, pagecache, and Meta TPP(page promotion/demotion among nodes) as the kernelspace memory tiering cases.

>Since local memory expanders are static devices, there shouldn't be a
>great need for hotplug, which means the memory could be mapped
>ZONE_NORMAL without issue.
>

IMHO, we think hot-add/remove is one of the key feature of CXL due to the composability aspect.
Right now, CXL device and system connection is limited. 
But industry is preparing a CXL capable system that allows more than 10 CXL channels at backplane, pluggable with EDSFF. 
Not only that, along with the progress of CXL topology - from direct-attached to switch, multi-level switch, and fabric connection -
I think the hot-add/remove usecase would become more important.

>> Allocating a user context out of ZONE_EXMEM is movable.
>>   a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,)
>> This is how ZONE_EXMEM supports the two cases.
>> 
>
>Is it intended for a user to explicitly request MAP_EXMEM for it to get
>used at all?  As in, if i simply mmap() without MAP_EXMEM, will it
>remain unutilized?

Our intention is to allow below 3 cases
1. Explicit DDR allocation - mmap(,,MAP_NORMAL,)
 : allocation from ZONE_NORMAL or ZONE_MOVABLE, or allocation fails.
2. Explicit CXL allocation - mmap(,,MAP_EXMEM,)
 : allocation from ZONE_EXMEM, of allocation fails.
3. Implicit Memory allocation - mmap(,,,) 
 : allocation from ZONE_NORMAL, ZONE_MOVABLE, or ZONE_EXMEM. In other words, no matter where DDR or CXL DRAM.

Among that, 3 is similar with vanilla kernel operation in that the allocation request traverses among multiple zones or nodes.
We think it would be good or bad for the mmap caller point of view.
It is good because memory is allocated, while it could be bad because the caller does not have idea of allocated memory type.
The later would hurt QoS metrics or userspace memory tiering operation, which expects near/far memory.

>
>~Gregory