On Fri, Jun 18, 2021 at 10:50 AM Wei Xu <weixugc@xxxxxxxxxx> wrote: > > In this proposal, I'd like to discuss userspace-managed memory tiering > and the kernel support that it needs. > > New memory technologies and interconnect standard make it possible to > have memory with different performance and cost on the same machine > (e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem). > We can expect heterogeneous memory systems that have performance > implications far beyond classical NUMA to become increasingly common > in the future. One of important use cases of such tiered memory > systems is to improve the data center and cloud efficiency with > better performance/TCO. > > Because different classes of applications (e.g. latency sensitive vs > latency tolerant, high priority vs low priority) have different > requirements, richer and more flexible memory tiering policies will > be needed to achieve the desired performance target on a tiered > memory system, which would be more effectively managed by a userspace > agent, not by the kernel. Moreover, we (Google) are explicitly trying > to avoid adding a ton of heuristics to enlighten the kernel about the > policy that we want on multi-tenant machines when the userspace offers > more flexibility. > > To manage memory tiering in userspace, we need the kernel support in > the three key areas: > > - resource abstraction and control of tiered memory; > - API to monitor page accesses for making memory tiering decisions; > - API to migrate pages (demotion/promotion). > > Userspace memory tiering can work on just NUMA memory nodes, provided > that memory resources from different tiers are abstracted into > separate NUMA nodes. The userspace agent can create a tiering > topology among these nodes based on their distances. > > An explicit memory tiering abstraction in the kernel is preferred, > though, because it can not only allow the kernel to react in cases > where it is challenging for userspace (e.g. reclaim-based demotion > when the system is under DRAM pressure due to usage surge), but also > enable tiering controls such as per-cgroup memory tier limits. > This requirement is mostly aligned with the existing proposals [1] > and [2]. > > The userspace agent manages all migratable user memory on the system > and this can be transparent from the point of view of applications. > To demote cold pages and promote hot pages, the userspace agent needs > page access information. Because it is a system-wide tiering for user > memory, the access information for both mapped and unmapped user pages > is needed, and so are the physical page addresses. A combination of > page table accessed-bit scanning and struct page scanning should be > needed. Such page access monitoring should be efficient as well > because the scans can be frequent. To return the page-level access > information to the userspace, one proposal is to use tracepoint > events. The userspace agent can then use BPF programs to collect such > data and also apply customized filters when necessary. Just FYI. There has been a project for userspace daemon. Please refer to https://github.com/fengguang/memory-optimizer We (Alibaba, when I was there) did some preliminary tests and benchmarks with it. The accuracy was pretty good, but the cost was relatively high. I agree with you that efficiency is the key. BPF may be a good approach to improve the cost. I'm not sure what the current status of this project is. You may reach Huang Ying to get more information. > > The userspace agent can also make use of hardware PMU events, for > which the existing kernel support should be sufficient. > > The third area is the API support for migrating pages. The existing > move_pages() syscall can be a candidate, though it is virtual-address > based and cannot migrate unmapped pages. Is a physical-address based > variant (e.g. move_pfns()), an acceptable proposal? > > [1] https://lore.kernel.org/lkml/9cd0dcde-f257-1b94-17d0-f2e24a3ce979@xxxxxxxxx/ > [2] https://lore.kernel.org/patchwork/cover/1408180/ > > Thanks, > Wei >