Re: [LSF/MM TOPIC] Tiered memory accounting and management

Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> · Fri, 18 Jun 2021 15:11:44 -0700

On 6/17/21 11:48 AM, Shakeel Butt wrote:
> Thanks Yang for the CC.
> 
> On Tue, Jun 15, 2021 at 5:17 PM Yang Shi <shy828301@xxxxxxxxx> wrote:
>>
>> On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:
>>>
>>>
>>> From: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>>>
>>> Tiered memory accounting and management
>>> ------------------------------------------------------------
>>> Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
>>> than others, but a byte of media has about the same cost whether it
>>> is close or far.  But, with new memory tiers such as High-Bandwidth
>>> Memory or Persistent Memory, there is a choice between fast/expensive
>>> and slow/cheap.  But, the current memory cgroups still live in the
>>> old model. There is only one set of limits, and it implies that all
>>> memory has the same cost.  We would like to extend memory cgroups to
>>> comprehend different memory tiers to give users a way to choose a mix
>>> between fast/expensive and slow/cheap.
>>>
>>> To manage such memory, we will need to account memory usage and
>>> impose limits for each kind of memory.
>>>
>>> There were a couple of approaches that have been discussed previously to partition
>>> the memory between the cgroups listed below.  We will like to
>>> use the LSF/MM session to come to a consensus on the approach to
>>> take.
>>>
>>> 1.      Per NUMA node limit and accounting for each cgroup.
>>> We can assign higher limits on better performing memory node for higher priority cgroups.
>>>
>>> There are some loose ends here that warrant further discussions:
>>> (1) A user friendly interface for such limits.  Will a proportional
>>> weight for the cgroup that translate to actual absolute limit be more suitable?
>>> (2) Memory mis-configurations can occur more easily as the admin
>>> has a much larger number of limits spread among between the
>>> cgroups to manage.  Over-restrictive limits can lead to under utilized
>>> and wasted memory and hurt performance.
>>> (3) OOM behavior when a cgroup hits its limit.
>>>
> 
> This (numa based limits) is something I was pushing for but after
> discussing this internally with userspace controller devs, I have to
> backoff from this position.
> 
> The main feedback I got was that setting one memory limit is already
> complicated and having to set/adjust these many limits would be
> horrifying.
> 
>>> 2.      Per memory tier limit and accounting for each cgroup.
>>> We can assign higher limits on memories in better performing
>>> memory tier for higher priority cgroups.  I previously
>>> prototyped a soft limit based implementation to demonstrate the
>>> tiered limit idea.
>>>
>>> There are also a number of issues here:
>>> (1)     The advantage is we have fewer limits to deal with simplifying
>>> configuration. However, there are doubts raised by a number
>>> of people on whether we can really properly classify the NUMA
>>> nodes into memory tiers. There could still be significant performance
>>> differences between NUMA nodes even for the same kind of memory.
>>> We will also not have the fine-grained control and flexibility that comes
>>> with a per NUMA node limit.
>>> (2)     Will a memory hierarchy defined by promotion/demotion relationship between
>>> memory nodes be a viable approach for defining memory tiers?
>>>
>>> These issues related to  the management of systems with multiple kind of memories
>>> can be ironed out in this session.
>>
>> Thanks for suggesting this topic. I'm interested in the topic and
>> would like to attend.
>>
>> Other than the above points. I'm wondering whether we shall discuss
>> "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the
>> development and I have been involved in the early development and
>> review, but it seems there are still some open questions according to
>> the latest review feedback.
>>
>> Some other folks may be interested in this topic either, CC'ed them in
>> the thread.
>>
> 
> At the moment "personally" I am more inclined towards a passive
> approach towards the memcg accounting of memory tiers. By that I mean,
> let's start by providing a 'usage' interface and get more
> production/real-world data to motivate the 'limit' interfaces. (One
> minor reason is that defining the 'limit' interface will force us to
> make the decision on defining tiers i.e. numa or a set of numa or
> others).

Probably we could first start with accounting the memory used in each
NUMA node for a cgroup and exposing this information to user space.  
I think that is useful regardless.

There is still a question of whether we want to define a set of
numa node or tier and extend the accounting and management at that
memory tier abstraction level.

> 
> IMHO we should focus more on the "aging" of the application memory and
> "migration/balance" between the tiers. I don't think the memory
> reclaim infrastructure is the right place for these operations
> (unevictable pages are ignored and not accurate ages). What we need is
> proactive continuous aging and balancing. We need something like, with
> additions, Multi-gen LRUs or DAMON or page idle tracking for aging and
> a new mechanism for balancing which takes ages into account.

Multi-gen LRUs will be pretty useful to expose the page warmth in a NUMA
node and to target the right page to reclaim for a memcg. We will also need some
way to determine how many pages to target in each memcg for a reclaim.

> 
> To give a more concrete example: Let's say we have a system with two
> memory tiers and multiple low and high priority jobs. For high
> priority jobs, set the allocation try list from high to low tier and
> for low priority jobs the reverse of that (I am not sure if we can do
> that out of the box with today's kernel). In the background we migrate
> cold memory down the tiers and hot memory in the reverse direction.
> 
> In this background mechanism we can enforce all different limiting
> policies like Yang's original high and low tier percentage or
> something like X% of accesses of high priority jobs should be from
> high tier. 

If I understand what you are saying is you desire the kernel to provide
the interface to expose performance information like 
"X% of accesses of high priority jobs is from high tier",
and knobs for user space to tell kernel to re-balance pages on
a per job class (or cgroup) basis based on this information.
The page re-balancing will be initiated by user space rather than
by the kernel, similar to what Wei proposed.

> Basically I am saying until we find from production data
> that this background mechanism is not strong enough to enforce passive
> limits, we should delay the decision on limit interfaces.
>

Implementing hard limit does have a number of rough edges
on a per node basis.  Probably we should first start with doing the
proper accounting and exposing the right performance information.

Tim