RE: Re: RE: RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>On Fri, Mar 31, 2023 at 08:34:17PM +0900, Kyungsan Kim wrote:
>> Hi Gregory Price. 
>> Thank you for joining this topic and share your viewpoint.
>> I'm sorry for late reply due to some major tasks of our team this week.
>> 
>> >On Fri, Mar 24, 2023 at 05:48:08PM +0900, Kyungsan Kim wrote:
>> >> 
>> >> Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE.
>> >> However, we think it would be a bad practice for the 2 reasons.
>> >> 1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. 
>> >> 2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention.
>> >> 
>> >> Allocating a kernel context out of ZONE_EXMEM is unmovable.
>> >>   a kernel context -  alloc_pages(GFP_EXMEM,)
>> >
>> >What is the specific use case of this?  If the answer is flexibility in
>> >low-memory situations, why wouldn't the kernel simply change to free up
>> >ZONE_NORMAL (swapping user memory, migrating user memory, etc) and
>> >allocate as needed?
>> >
>> >I could see allocating kernel memory from local memory expanders
>> >(directly attached to local CXL port), but I can't think of a case where
>> >it would be preferable for kernel resources to live on remote memory.
>> 
>> We have thought kernelspace memory tiering cases.
>> What memory tiering we assumes is to locate a hot data in fast memory and a cold data in slow memory.
>> We think zswap, pagecache, and Meta TPP(page promotion/demotion among nodes) as the kernelspace memory tiering cases.
>>
>
>So, to clarify, when you say "kernel space memory tiering cases", do you
>mean "to support a kernel-space controlled memory tiering service" or do
>you mean "tiering of kernel memory"?

Actually, both. 
Bollowing your expression :), we imply "kernel-space controlled memory tiering service that tiers kernel memory".
For example, while zswap operation (=a kernel space memory tiering case) of vanilla kernel,
when an user page from CXL DRAM is swapped-out, zbud allocator of zswap can allocate a zswap page from DDR_DRAM(=tiering of kernel memory).
We think it is odd, because the swapped page is promoted from CXL DRAM(far memory) to DDR DRAM(near memory).

>Because if it's the former, rather than a new zone, it seems like a
>better proposal would be to extend the numa system to add additional
>"cost/feature" attributes, rather than modifying the zone of the memory
>blocks backing the node.
>
>Note that memory zones can apply to individual blocks within a node, and
>not the entire node uniformly.  So when making tiering decisions, it
>seems more expedient to investigate a node rather than a block.
>
>
>> >Since local memory expanders are static devices, there shouldn't be a
>> >great need for hotplug, which means the memory could be mapped
>> >ZONE_NORMAL without issue.
>> >
>> 
>> IMHO, we think hot-add/remove is one of the key feature of CXL due to the composability aspect.
>> Right now, CXL device and system connection is limited. 
>> But industry is preparing a CXL capable system that allows more than 10 CXL channels at backplane, pluggable with EDSFF. 
>> Not only that, along with the progress of CXL topology - from direct-attached to switch, multi-level switch, and fabric connection -
>> I think the hot-add/remove usecase would become more important.
>> 
>> 
>
>Hot add/remove is somewhat fairly represented by ZONE_MOVABLE. What's I
>think confusing many people is that creating a new zone that's intended
>to be hot-pluggable *and* usable by kernel for kernel-resources/memory
>are presently exclusive operations.
>
>The underlying question is what situation is being hit in which kernel
>memory wants to be located in ZONE_MOVABLE/ZONE_EXMEM that cannot simply
>be serviced by demoting other, movable memory to these regions.
>
>The concept being that kernel allocations are a higher-priority
>allocation than userland, and as such should have priority in DRAM.
>
>For example - there is at least one paper that examined the cost of
>placing page tables on CXL Memory Expansion (on the local CXL complex,
>not remote) and found the cost is significant.  Page tables are likely
>the single largest allocation the kernel will make to service large
>memory structures, so the answer to this problem is not necessarily to
>place that memory in CXL as well, but to use larger page sizes (which is
>less wasteful as memory usage is high and memory is abundant).
>
>I just don't understand what kernel resources would meet the following
>attributes:
>
>1) Do not have major system performance impacts in high-latency memory
>2) Are sufficiently large to warrant tiering
>and
>3) Are capable of being moved (i.e. no pinned areas, no dma areas, etc)
>

I agree the entire level of page table should be on near memory.
In general, a data need to be handled quickly prefer a near memory such as indexing.
For far memory needs, it would be a data that is less user-interactive and latency-senstive.
Basically, our approach is on memory provider stance, not on memory consumer stance. 

>> >> Allocating a user context out of ZONE_EXMEM is movable.
>> >>   a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,)
>> >> This is how ZONE_EXMEM supports the two cases.
>> >> 
>
>So if MAP_EXMEM is not used, EXMEM would not be used?
>
>That seems counter intuitive.  If an allocation via mmap would be
>eligible for ZONE_MOVABLE, why wouldn't it be eligible for ZONE_EXMEM?
>
>I believe this is another reason why some folks are confused what the
>distinction between MOVABLE and EXMEM are.  They seem to ultimately
>reduce to whether the memory can be moved.

Not really. We intended EXMEM can be used both implicitly and explicitly.
Please further refer to the answer below.

>
>> >
>> >Is it intended for a user to explicitly request MAP_EXMEM for it to get
>> >used at all?  As in, if i simply mmap() without MAP_EXMEM, will it
>> >remain unutilized?
>> 
>> Our intention is to allow below 3 cases
>> 1. Explicit DDR allocation - mmap(,,MAP_NORMAL,)
>>  : allocation from ZONE_NORMAL or ZONE_MOVABLE, or allocation fails.
>> 2. Explicit CXL allocation - mmap(,,MAP_EXMEM,)
>>  : allocation from ZONE_EXMEM, of allocation fails.
>> 3. Implicit Memory allocation - mmap(,,,) 
>>  : allocation from ZONE_NORMAL, ZONE_MOVABLE, or ZONE_EXMEM. In other words, no matter where DDR or CXL DRAM.
>> 
>> Among that, 3 is similar with vanilla kernel operation in that the allocation request traverses among multiple zones or nodes.
>> We think it would be good or bad for the mmap caller point of view.
>> It is good because memory is allocated, while it could be bad because the caller does not have idea of allocated memory type.
>> The later would hurt QoS metrics or userspace memory tiering operation, which expects near/far memory.
>> 
>
>For what it's worth, mmap is not the correct api for userland to provide
>kernel hints on data placement.  That would be madvise and friends.

Yes, our key intention is to provide a hint to userland.
Not only mmap(), but mbind(), set_mempolicy(), madvise(), etc

>
>But further, allocation of memory from userland must be ok with having
>its memory moved/swapped/whatever unless additional assistance from the
>kernel is provided (page pinning, mlock, whatever) to ensure it will
>not be moved.  Presumably this is done to ensure the kernel can make
>runtime adjustments to protect itself from being denied memory and
>causing instability and/or full system faults.

Yes. in case of the implicit allocation, our proposal is fully compatible with vanilla linux MM.
Our thought is to provide both explcit and implicit ways.

>
>
>I think you need to clarify your intents for this zone, in particular
>your intent for exactly what data can and cannot live in this zone and
>the reasons for this.  "To assist kernel tiering operations" is very
>vague and not a description of what memory is and is not allowed in the
>zone.

We don't confine a data for ZONE_EXMEM. 
Our intention is to allow both movable and ummovable allocation from a kernel and user context.
Also, an allocation context is able to determine the movability.
In other words, the ZONE_EXMEM is not inteded to confine a usecase, but provide ways to do a usecase on CXL DRAM.

>
>~Gregory



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux