Re: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, Kyungsan,

Kyungsan Kim <ks0204.kim@xxxxxxxxxxx> writes:

> CXL is a promising technology that leads to fundamental changes in computing architecture.
> To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
> Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
> Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
> At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
> to propose possible Linux MM changes with a brief of SMDK.
>
> Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
> Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
> When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.
>
>  
> 1. A new zone, ZONE_EXMEM
> We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.
>
> 1) a CXL RAM has many different characteristics with conventional DRAM because a CXL device inherits and expands PCIe specification.
> ex) frequency range, pluggability, link speed/width negotiation, host/device flow control, power throttling, channel-interleaving methodology, error handling, and etc.
> It is likely that the primary usecase of CXL RAM would be System RAM.
> However, to deal with the hardware differences properly, different MM algorithms are needed accordingly.
>
> 2) Historically, zone has been expanded by reflecting the evolution of CPU, IO, and memory devices.
> ex) ZONE_DMA(32), ZONE_HIGHMEM, ZONE_DEVICE, and ZONE_MOVABLE.
> Each zone applies different MM algorithms such as page reclaim, compaction, migration, and fragmentation.
> At first, we tried reuse of existing zones, ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.
> However, the purpose and implementation of the zones are not fit for CXL RAM.
>
> 3) Industry is preparing a CXL-capable system that connects dozens of CXL devices in a server system.
> When a CXL device becomes a separate node, an administrator/programmer needs to be aware of and manually control all nodes using 3rd party software, such as numactl and libnuma.
> ZONE_EXMEM allows the assemble of CXL RAM devices into the single ZONE_EXMEM zone, and provides an abstraction to userspace by seamlessly managing the devices.
> Also, the zone is able to interleave assembled devices in a software way to lead to aggregated bandwidth.
> We would like to suggest if it is co-existable with HW interleaving like SW/HW raid0.
> To help understanding, please refer to the node partition part of the picture[3].

In addition to CXL memory, we may have other kind of memory in the
system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
memory in GPU card, etc.  I guess that we need to consider them
together.  Do we need to add one zone type for each kind of memory?

>
> 2. User/Kernelspace Programmable Interface
> In terms of a memory tiering solution, it is typical that the solution attempts to locate hot data on near memory, and cold data on far memory as accurately as possible.[4][5][6][7]
> We noticed that the hot/coldness of data is determined by the memory access pattern of running application and/or kernel context.
> Hence, a running context needs a near/far memory identifier to determine near/far memory. 
> When CXL RAM(s) is manipulated as a NUMA node, a node id can be function as a CXL identifier more or less.
> However, the node id has limitation in that it is an ephemeral information that dynamically varies according to online status of CXL topology and system socket.
> In this sense, we provides programmable interfaces for userspace and kernelspace context to explicitly (de)allocate memory from DRAM and CXL RAM regardless of a system change.
> Specifically, MAP_EXMEM and GFP_EXMEM flags were added to mmap() syscall and kmalloc() siblings, respectively.

In addition to NUMA node, we have defined the following interfaces to
expose information about different kind of memory in the system.

https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html#abi-sys-devices-virtual-memory-tiering

Best Regards,
Huang, Ying

> Thanks to Adam Manzanares for reviewing this CFP thoroughly.
>
>
> [1]SMDK: https://github.com/openMPDK/SMDK
> [2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
> [3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
> [4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
> [5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
> [6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
> [7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux