>On Fri, Mar 31, 2023 at 08:34:17PM +0900, Kyungsan Kim wrote: >> Hi Gregory Price. >> Thank you for joining this topic and share your viewpoint. >> I'm sorry for late reply due to some major tasks of our team this week. >> >> >On Fri, Mar 24, 2023 at 05:48:08PM +0900, Kyungsan Kim wrote: >> >> >> >> Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE. >> >> However, we think it would be a bad practice for the 2 reasons. >> >> 1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. >> >> 2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention. >> >> >> >> Allocating a kernel context out of ZONE_EXMEM is unmovable. >> >> a kernel context - alloc_pages(GFP_EXMEM,) >> > >> >What is the specific use case of this? If the answer is flexibility in >> >low-memory situations, why wouldn't the kernel simply change to free up >> >ZONE_NORMAL (swapping user memory, migrating user memory, etc) and >> >allocate as needed? >> > >> >I could see allocating kernel memory from local memory expanders >> >(directly attached to local CXL port), but I can't think of a case where >> >it would be preferable for kernel resources to live on remote memory. >> >> We have thought kernelspace memory tiering cases. >> What memory tiering we assumes is to locate a hot data in fast memory and a cold data in slow memory. >> We think zswap, pagecache, and Meta TPP(page promotion/demotion among nodes) as the kernelspace memory tiering cases. >> > >So, to clarify, when you say "kernel space memory tiering cases", do you >mean "to support a kernel-space controlled memory tiering service" or do >you mean "tiering of kernel memory"? Actually, both. Bollowing your expression :), we imply "kernel-space controlled memory tiering service that tiers kernel memory". For example, while zswap operation (=a kernel space memory tiering case) of vanilla kernel, when an user page from CXL DRAM is swapped-out, zbud allocator of zswap can allocate a zswap page from DDR_DRAM(=tiering of kernel memory). We think it is odd, because the swapped page is promoted from CXL DRAM(far memory) to DDR DRAM(near memory). >Because if it's the former, rather than a new zone, it seems like a >better proposal would be to extend the numa system to add additional >"cost/feature" attributes, rather than modifying the zone of the memory >blocks backing the node. > >Note that memory zones can apply to individual blocks within a node, and >not the entire node uniformly. So when making tiering decisions, it >seems more expedient to investigate a node rather than a block. > > >> >Since local memory expanders are static devices, there shouldn't be a >> >great need for hotplug, which means the memory could be mapped >> >ZONE_NORMAL without issue. >> > >> >> IMHO, we think hot-add/remove is one of the key feature of CXL due to the composability aspect. >> Right now, CXL device and system connection is limited. >> But industry is preparing a CXL capable system that allows more than 10 CXL channels at backplane, pluggable with EDSFF. >> Not only that, along with the progress of CXL topology - from direct-attached to switch, multi-level switch, and fabric connection - >> I think the hot-add/remove usecase would become more important. >> >> > >Hot add/remove is somewhat fairly represented by ZONE_MOVABLE. What's I >think confusing many people is that creating a new zone that's intended >to be hot-pluggable *and* usable by kernel for kernel-resources/memory >are presently exclusive operations. > >The underlying question is what situation is being hit in which kernel >memory wants to be located in ZONE_MOVABLE/ZONE_EXMEM that cannot simply >be serviced by demoting other, movable memory to these regions. > >The concept being that kernel allocations are a higher-priority >allocation than userland, and as such should have priority in DRAM. > >For example - there is at least one paper that examined the cost of >placing page tables on CXL Memory Expansion (on the local CXL complex, >not remote) and found the cost is significant. Page tables are likely >the single largest allocation the kernel will make to service large >memory structures, so the answer to this problem is not necessarily to >place that memory in CXL as well, but to use larger page sizes (which is >less wasteful as memory usage is high and memory is abundant). > >I just don't understand what kernel resources would meet the following >attributes: > >1) Do not have major system performance impacts in high-latency memory >2) Are sufficiently large to warrant tiering >and >3) Are capable of being moved (i.e. no pinned areas, no dma areas, etc) > I agree the entire level of page table should be on near memory. In general, a data need to be handled quickly prefer a near memory such as indexing. For far memory needs, it would be a data that is less user-interactive and latency-senstive. Basically, our approach is on memory provider stance, not on memory consumer stance. >> >> Allocating a user context out of ZONE_EXMEM is movable. >> >> a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,) >> >> This is how ZONE_EXMEM supports the two cases. >> >> > >So if MAP_EXMEM is not used, EXMEM would not be used? > >That seems counter intuitive. If an allocation via mmap would be >eligible for ZONE_MOVABLE, why wouldn't it be eligible for ZONE_EXMEM? > >I believe this is another reason why some folks are confused what the >distinction between MOVABLE and EXMEM are. They seem to ultimately >reduce to whether the memory can be moved. Not really. We intended EXMEM can be used both implicitly and explicitly. Please further refer to the answer below. > >> > >> >Is it intended for a user to explicitly request MAP_EXMEM for it to get >> >used at all? As in, if i simply mmap() without MAP_EXMEM, will it >> >remain unutilized? >> >> Our intention is to allow below 3 cases >> 1. Explicit DDR allocation - mmap(,,MAP_NORMAL,) >> : allocation from ZONE_NORMAL or ZONE_MOVABLE, or allocation fails. >> 2. Explicit CXL allocation - mmap(,,MAP_EXMEM,) >> : allocation from ZONE_EXMEM, of allocation fails. >> 3. Implicit Memory allocation - mmap(,,,) >> : allocation from ZONE_NORMAL, ZONE_MOVABLE, or ZONE_EXMEM. In other words, no matter where DDR or CXL DRAM. >> >> Among that, 3 is similar with vanilla kernel operation in that the allocation request traverses among multiple zones or nodes. >> We think it would be good or bad for the mmap caller point of view. >> It is good because memory is allocated, while it could be bad because the caller does not have idea of allocated memory type. >> The later would hurt QoS metrics or userspace memory tiering operation, which expects near/far memory. >> > >For what it's worth, mmap is not the correct api for userland to provide >kernel hints on data placement. That would be madvise and friends. Yes, our key intention is to provide a hint to userland. Not only mmap(), but mbind(), set_mempolicy(), madvise(), etc > >But further, allocation of memory from userland must be ok with having >its memory moved/swapped/whatever unless additional assistance from the >kernel is provided (page pinning, mlock, whatever) to ensure it will >not be moved. Presumably this is done to ensure the kernel can make >runtime adjustments to protect itself from being denied memory and >causing instability and/or full system faults. Yes. in case of the implicit allocation, our proposal is fully compatible with vanilla linux MM. Our thought is to provide both explcit and implicit ways. > > >I think you need to clarify your intents for this zone, in particular >your intent for exactly what data can and cannot live in this zone and >the reasons for this. "To assist kernel tiering operations" is very >vague and not a description of what memory is and is not allowed in the >zone. We don't confine a data for ZONE_EXMEM. Our intention is to allow both movable and ummovable allocation from a kernel and user context. Also, an allocation context is able to determine the movability. In other words, the ZONE_EXMEM is not inteded to confine a usecase, but provide ways to do a usecase on CXL DRAM. > >~Gregory