[LSF/MM/BPF TOPIC] Userspace managed memory tiering

Wei Xu <weixugc@xxxxxxxxxx> · Fri, 18 Jun 2021 10:50:29 -0700

In this proposal, I'd like to discuss userspace-managed memory tiering
and the kernel support that it needs.

New memory technologies and interconnect standard make it possible to
have memory with different performance and cost on the same machine
(e.g. DRAM + PMEM, DRAM + cost-optimized memory attached via CXL.mem).
We can expect heterogeneous memory systems that have performance
implications far beyond classical NUMA to become increasingly common
in the future.  One of important use cases of such tiered memory
systems is to improve the data center and cloud efficiency with
better performance/TCO.

Because different classes of applications (e.g. latency sensitive vs
latency tolerant, high priority vs low priority) have different
requirements, richer and more flexible memory tiering policies will
be needed to achieve the desired performance target on a tiered
memory system, which would be more effectively managed by a userspace
agent, not by the kernel.  Moreover, we (Google) are explicitly trying
to avoid adding a ton of heuristics to enlighten the kernel about the
policy that we want on multi-tenant machines when the userspace offers
more flexibility.

To manage memory tiering in userspace, we need the kernel support in
the three key areas:

- resource abstraction and control of tiered memory;
- API to monitor page accesses for making memory tiering decisions;
- API to migrate pages (demotion/promotion).

Userspace memory tiering can work on just NUMA memory nodes, provided
that memory resources from different tiers are abstracted into
separate NUMA nodes.  The userspace agent can create a tiering
topology among these nodes based on their distances.

An explicit memory tiering abstraction in the kernel is preferred,
though, because it can not only allow the kernel to react in cases
where it is challenging for userspace (e.g. reclaim-based demotion
when the system is under DRAM pressure due to usage surge), but also
enable tiering controls such as per-cgroup memory tier limits.
This requirement is mostly aligned with the existing proposals [1]
and [2].

The userspace agent manages all migratable user memory on the system
and this can be transparent from the point of view of applications.
To demote cold pages and promote hot pages, the userspace agent needs
page access information.  Because it is a system-wide tiering for user
memory, the access information for both mapped and unmapped user pages
is needed, and so are the physical page addresses.  A combination of
page table accessed-bit scanning and struct page scanning should be
needed.  Such page access monitoring should be efficient as well
because the scans can be frequent. To return the page-level access
information to the userspace, one proposal is to use tracepoint
events. The userspace agent can then use BPF programs to collect such
data and also apply customized filters when necessary.

The userspace agent can also make use of hardware PMU events, for
which the existing kernel support should be sufficient.

The third area is the API support for migrating pages. The existing
move_pages() syscall can be a candidate, though it is virtual-address
based and cannot migrate unmapped pages.  Is a physical-address based
variant (e.g. move_pfns()), an acceptable proposal?

[1] https://lore.kernel.org/lkml/9cd0dcde-f257-1b94-17d0-f2e24a3ce979@xxxxxxxxx/
[2] https://lore.kernel.org/patchwork/cover/1408180/

Thanks,
Wei