RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Huang Ying,

I apologize late reply for personal schedule.
Thank you for sharing your viewpoint and the information.


>Hi, Kyungsan,
>
>Kyungsan Kim <ks0204.kim@xxxxxxxxxxx> writes:
>
>> CXL is a promising technology that leads to fundamental changes in computing architecture.
>> To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
>> Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
>> Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
>> At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
>> to propose possible Linux MM changes with a brief of SMDK.
>>
>> Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
>> Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
>> When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.
>>
>> 
>> 1. A new zone, ZONE_EXMEM
>> We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.
>>
>> 1) a CXL RAM has many different characteristics with conventional DRAM because a CXL device inherits and expands PCIe specification.
>> ex) frequency range, pluggability, link speed/width negotiation, host/device flow control, power throttling, channel-interleaving methodology, error handling, and etc.
>> It is likely that the primary usecase of CXL RAM would be System RAM.
>> However, to deal with the hardware differences properly, different MM algorithms are needed accordingly.
>>
>> 2) Historically, zone has been expanded by reflecting the evolution of CPU, IO, and memory devices.
>> ex) ZONE_DMA(32), ZONE_HIGHMEM, ZONE_DEVICE, and ZONE_MOVABLE.
>> Each zone applies different MM algorithms such as page reclaim, compaction, migration, and fragmentation.
>> At first, we tried reuse of existing zones, ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.
>> However, the purpose and implementation of the zones are not fit for CXL RAM.
>>
>> 3) Industry is preparing a CXL-capable system that connects dozens of CXL devices in a server system.
>> When a CXL device becomes a separate node, an administrator/programmer needs to be aware of and manually control all nodes using 3rd party software, such as numactl and libnuma.
>> ZONE_EXMEM allows the assemble of CXL RAM devices into the single ZONE_EXMEM zone, and provides an abstraction to userspace by seamlessly managing the devices.
>> Also, the zone is able to interleave assembled devices in a software way to lead to aggregated bandwidth.
>> We would like to suggest if it is co-existable with HW interleaving like SW/HW raid0.
>> To help understanding, please refer to the node partition part of the picture[3].
>
>In addition to CXL memory, we may have other kind of memory in the
>system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>memory in GPU card, etc.  I guess that we need to consider them
>together.  Do we need to add one zone type for each kind of memory?

We also don't think a new zone is needed for every single memory device.
Our viewpoint is the sole ZONE_NORMAL becomes not enough to manage multiple volatile memory devices due to the increased device types.
Including CXL DRAM, we think the ZONE_EXMEM can be used to represent extended volatile memories that have different HW characteristics.
 
>
>>
>> 2. User/Kernelspace Programmable Interface
>> In terms of a memory tiering solution, it is typical that the solution attempts to locate hot data on near memory, and cold data on far memory as accurately as possible.[4][5][6][7]
>> We noticed that the hot/coldness of data is determined by the memory access pattern of running application and/or kernel context.
>> Hence, a running context needs a near/far memory identifier to determine near/far memory.
>> When CXL RAM(s) is manipulated as a NUMA node, a node id can be function as a CXL identifier more or less.
>> However, the node id has limitation in that it is an ephemeral information that dynamically varies according to online status of CXL topology and system socket.
>> In this sense, we provides programmable interfaces for userspace and kernelspace context to explicitly (de)allocate memory from DRAM and CXL RAM regardless of a system change.
>> Specifically, MAP_EXMEM and GFP_EXMEM flags were added to mmap() syscall and kmalloc() siblings, respectively.
>
>In addition to NUMA node, we have defined the following interfaces to
>expose information about different kind of memory in the system.
>
>https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html#abi-sys-devices-virtual-memory-tiering
>
>Best Regards,
>Huang, Ying

The sysfs looks useful to prioritize a group of fast/slow memory-node using a list of node id.
We would say it is collaborative with the programmable interfaces we suggested.

                  User/Kernel context (MAP_EXMEM/GFP_EXMEM)
                                   |
               ---------------------------------------------
               |                                                    |
[sysfs/memory_tier0 - DDR Node list]   [sysfs/memory_tier1 - CXL Node list]

>
>> Thanks to Adam Manzanares for reviewing this CFP thoroughly.
>>
>>
>> [1]SMDK: https://github.com/openMPDK/SMDK
>> [2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
>> [3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
>> [4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
>> [5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
>> [6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
>> [7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux