[LSF/MM TOPIC] Tiered memory accounting and management

Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> · Mon, 14 Jun 2021 14:51:04 -0700

From: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>

Tiered memory accounting and management
------------------------------------------------------------
Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
than others, but a byte of media has about the same cost whether it
is close or far.  But, with new memory tiers such as High-Bandwidth
Memory or Persistent Memory, there is a choice between fast/expensive
and slow/cheap.  But, the current memory cgroups still live in the
old model. There is only one set of limits, and it implies that all
memory has the same cost.  We would like to extend memory cgroups to
comprehend different memory tiers to give users a way to choose a mix
between fast/expensive and slow/cheap.

To manage such memory, we will need to account memory usage and
impose limits for each kind of memory.

There were a couple of approaches that have been discussed previously to partition
the memory between the cgroups listed below.  We will like to
use the LSF/MM session to come to a consensus on the approach to
take.

1.	Per NUMA node limit and accounting for each cgroup.  
We can assign higher limits on better performing memory node for higher priority cgroups.

There are some loose ends here that warrant further discussions: 
(1) A user friendly interface for such limits.  Will a proportional
weight for the cgroup that translate to actual absolute limit be more suitable?
(2) Memory mis-configurations can occur more easily as the admin
has a much larger number of limits spread among between the
cgroups to manage.  Over-restrictive limits can lead to under utilized
and wasted memory and hurt performance. 
(3) OOM behavior when a cgroup hits its limit.

2.	Per memory tier limit and accounting for each cgroup. 
We can assign higher limits on memories in better performing 
memory tier for higher priority cgroups.  I previously
prototyped a soft limit based implementation to demonstrate the 
tiered limit idea.

There are also a number of issues here:
(1)	The advantage is we have fewer limits to deal with simplifying
configuration. However, there are doubts raised by a number 
of people on whether we can really properly classify the NUMA 
nodes into memory tiers. There could still be significant performance 
differences between NUMA nodes even for the same kind of memory.
We will also not have the fine-grained control and flexibility that comes
with a per NUMA node limit.
(2)	Will a memory hierarchy defined by promotion/demotion relationship between
memory nodes be a viable approach for defining memory tiers?

These issues related to  the management of systems with multiple kind of memories
can be ironed out in this session.