Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

MTK <kim1158@xxxxxxxxx> · Wed, 10 May 2023 03:45:45 +0900

Hello all,

I appreciate all of the feedbacks and questions while my session at
5/8 13:00 PDT.
For those who are interested, please find my slide at [2].
My apology that I failed to manage the time slot so that I missed some
contents prepared.

Program Committee kindly allows me a make-up session to spend a few more minutes
around 5/10 15:30 PST after MM process: Akpm. Please find the schedule[1].
Thank you Dan Williams and Michal Hocko.

The remaining dialog I keep in mind now is
- more sync-up of CXL requirements to kernel
- what ZONE_EXMEM do for the requirements
- quick answers for the feedbacks I missed at 5/8
- alignment with kernel movement

[1] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL
[2] https://docs.google.com/spreadsheets/d/1tIDYHgLhhcetoXtgyvcoM6YZWWHcVLdNYipBq2dH-_k/edit#gid=0

On Fri, Apr 14, 2023 at 5:45 PM Kyungsan Kim <ks0204.kim@xxxxxxxxxxx> wrote:
>
> >CXL is a promising technology that leads to fundamental changes in computing architecture.
> >To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
> >Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
> >Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
> >At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
> >to propose possible Linux MM changes with a brief of SMDK.
> >
> >Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
> >Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
> >When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.
> >
> >
> >1. A new zone, ZONE_EXMEM
> >We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.
> >
> >1) a CXL RAM has many different characteristics with conventional DRAM because a CXL device inherits and expands PCIe specification.
> >ex) frequency range, pluggability, link speed/width negotiation, host/device flow control, power throttling, channel-interleaving methodology, error handling, and etc.
> >It is likely that the primary usecase of CXL RAM would be System RAM.
> >However, to deal with the hardware differences properly, different MM algorithms are needed accordingly.
> >
> >2) Historically, zone has been expanded by reflecting the evolution of CPU, IO, and memory devices.
> >ex) ZONE_DMA(32), ZONE_HIGHMEM, ZONE_DEVICE, and ZONE_MOVABLE.
> >Each zone applies different MM algorithms such as page reclaim, compaction, migration, and fragmentation.
> >At first, we tried reuse of existing zones, ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.
> >However, the purpose and implementation of the zones are not fit for CXL RAM.
> >
> >3) Industry is preparing a CXL-capable system that connects dozens of CXL devices in a server system.
> >When a CXL device becomes a separate node, an administrator/programmer needs to be aware of and manually control all nodes using 3rd party software, such as numactl and libnuma.
> >ZONE_EXMEM allows the assemble of CXL RAM devices into the single ZONE_EXMEM zone, and provides an abstraction to userspace by seamlessly managing the devices.
> >Also, the zone is able to interleave assembled devices in a software way to lead to aggregated bandwidth.
> >We would like to suggest if it is co-existable with HW interleaving like SW/HW raid0.
> >To help understanding, please refer to the node partition part of the picture[3].
> >
> >
> >2. User/Kernelspace Programmable Interface
> >In terms of a memory tiering solution, it is typical that the solution attempts to locate hot data on near memory, and cold data on far memory as accurately as possible.[4][5][6][7]
> >We noticed that the hot/coldness of data is determined by the memory access pattern of running application and/or kernel context.
> >Hence, a running context needs a near/far memory identifier to determine near/far memory.
> >When CXL RAM(s) is manipulated as a NUMA node, a node id can be function as a CXL identifier more or less.
> >However, the node id has limitation in that it is an ephemeral information that dynamically varies according to online status of CXL topology and system socket.
> >In this sense, we provides programmable interfaces for userspace and kernelspace context to explicitly (de)allocate memory from DRAM and CXL RAM regardless of a system change.
> >Specifically, MAP_EXMEM and GFP_EXMEM flags were added to mmap() syscall and kmalloc() siblings, respectively.
> >
> >Thanks to Adam Manzanares for reviewing this CFP thoroughly.
> >
> >
> >[1]SMDK: https://github.com/openMPDK/SMDK
> >[2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
> >[3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
> >[4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
> >[5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
> >[6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
> >[7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf
>
> Let us restate the original CFP as requirement point of view and the thought on that.
>
> 1) CXL DRAM pluggability
> Issue: a random unmovable allocation makes a CXL DRAM unpluggable.
> It can happen out of userspace e.g.) pinning for DMA buffer, or kernelspace e.g.) pinning for metadata such as struct page, zone, etc.
> For this matter, we should separate logical memory on/offline and physical add/remove.
> Thought: a CXL DRAM should be able to be used in a selective manner, pluggable or unpluggable.
> But, please don't get this wrong. Those are mutual-exclusive, so it cannot happen at the same time on a single CXL DRAM channel.
>
> 2) CXL DRAM identifier (API and ABI)
> Issue: an user/kernel context has to use the node id of a CXL memory-node to access CXL DRAM explicitly and implicitly.
> Thought: Node id would be ephemeral information. An userspace and kernelspace memory tiering solution need a API and/or ABI rather than node id.
>
> 3) Prevention of unintended CXL page migration
> Issue: while zswap operation, a page on near memory(DIMM DRAM) is allocated to store swapped page on far memory(CXL DRAM).
> Our thought: On the swap flow, the far memory should not be promoted to near memory accidentally.
>
> 4) Too many CXL nodes appearing in userland
> Issue: many CXL memory nodes would be appeared to userland along with development of a CXL capable server, switch and fabric topology.
> Currently, to lead to aggregated bandwidth among the CXL nodes, an userland needs to be aware and manage the nodes using a 3rd party SW such as numactl and libnuma.
> Thought: Kernel would provide an abstraction layer for userland to deal with it seamlessly.
> By the way, traditionally a node implies multiple memory channels in the same distance, and a node is the largest management unit in MM. i.e.) Node - Zone - Page.
> So, we thought that multiple CXL DRAMs can be appeared as a node, so the management dimension for single CXL DRAM should be smaller than node.
>

-- 
------------------------------------------------------------
the person who practices a truth goes toward light.