On 6/17/21 11:48 AM, Shakeel Butt wrote: > Thanks Yang for the CC. > > On Tue, Jun 15, 2021 at 5:17 PM Yang Shi <shy828301@xxxxxxxxx> wrote: >> >> On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote: >>> >>> >>> From: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> >>> >>> Tiered memory accounting and management >>> ------------------------------------------------------------ >>> Traditionally, all RAM is DRAM. Some DRAM might be closer/faster >>> than others, but a byte of media has about the same cost whether it >>> is close or far. But, with new memory tiers such as High-Bandwidth >>> Memory or Persistent Memory, there is a choice between fast/expensive >>> and slow/cheap. But, the current memory cgroups still live in the >>> old model. There is only one set of limits, and it implies that all >>> memory has the same cost. We would like to extend memory cgroups to >>> comprehend different memory tiers to give users a way to choose a mix >>> between fast/expensive and slow/cheap. >>> >>> To manage such memory, we will need to account memory usage and >>> impose limits for each kind of memory. >>> >>> There were a couple of approaches that have been discussed previously to partition >>> the memory between the cgroups listed below. We will like to >>> use the LSF/MM session to come to a consensus on the approach to >>> take. >>> >>> 1. Per NUMA node limit and accounting for each cgroup. >>> We can assign higher limits on better performing memory node for higher priority cgroups. >>> >>> There are some loose ends here that warrant further discussions: >>> (1) A user friendly interface for such limits. Will a proportional >>> weight for the cgroup that translate to actual absolute limit be more suitable? >>> (2) Memory mis-configurations can occur more easily as the admin >>> has a much larger number of limits spread among between the >>> cgroups to manage. Over-restrictive limits can lead to under utilized >>> and wasted memory and hurt performance. >>> (3) OOM behavior when a cgroup hits its limit. >>> > > This (numa based limits) is something I was pushing for but after > discussing this internally with userspace controller devs, I have to > backoff from this position. > > The main feedback I got was that setting one memory limit is already > complicated and having to set/adjust these many limits would be > horrifying. > >>> 2. Per memory tier limit and accounting for each cgroup. >>> We can assign higher limits on memories in better performing >>> memory tier for higher priority cgroups. I previously >>> prototyped a soft limit based implementation to demonstrate the >>> tiered limit idea. >>> >>> There are also a number of issues here: >>> (1) The advantage is we have fewer limits to deal with simplifying >>> configuration. However, there are doubts raised by a number >>> of people on whether we can really properly classify the NUMA >>> nodes into memory tiers. There could still be significant performance >>> differences between NUMA nodes even for the same kind of memory. >>> We will also not have the fine-grained control and flexibility that comes >>> with a per NUMA node limit. >>> (2) Will a memory hierarchy defined by promotion/demotion relationship between >>> memory nodes be a viable approach for defining memory tiers? >>> >>> These issues related to the management of systems with multiple kind of memories >>> can be ironed out in this session. >> >> Thanks for suggesting this topic. I'm interested in the topic and >> would like to attend. >> >> Other than the above points. I'm wondering whether we shall discuss >> "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the >> development and I have been involved in the early development and >> review, but it seems there are still some open questions according to >> the latest review feedback. >> >> Some other folks may be interested in this topic either, CC'ed them in >> the thread. >> > > At the moment "personally" I am more inclined towards a passive > approach towards the memcg accounting of memory tiers. By that I mean, > let's start by providing a 'usage' interface and get more > production/real-world data to motivate the 'limit' interfaces. (One > minor reason is that defining the 'limit' interface will force us to > make the decision on defining tiers i.e. numa or a set of numa or > others). Probably we could first start with accounting the memory used in each NUMA node for a cgroup and exposing this information to user space. I think that is useful regardless. There is still a question of whether we want to define a set of numa node or tier and extend the accounting and management at that memory tier abstraction level. > > IMHO we should focus more on the "aging" of the application memory and > "migration/balance" between the tiers. I don't think the memory > reclaim infrastructure is the right place for these operations > (unevictable pages are ignored and not accurate ages). What we need is > proactive continuous aging and balancing. We need something like, with > additions, Multi-gen LRUs or DAMON or page idle tracking for aging and > a new mechanism for balancing which takes ages into account. Multi-gen LRUs will be pretty useful to expose the page warmth in a NUMA node and to target the right page to reclaim for a memcg. We will also need some way to determine how many pages to target in each memcg for a reclaim. > > To give a more concrete example: Let's say we have a system with two > memory tiers and multiple low and high priority jobs. For high > priority jobs, set the allocation try list from high to low tier and > for low priority jobs the reverse of that (I am not sure if we can do > that out of the box with today's kernel). In the background we migrate > cold memory down the tiers and hot memory in the reverse direction. > > In this background mechanism we can enforce all different limiting > policies like Yang's original high and low tier percentage or > something like X% of accesses of high priority jobs should be from > high tier. If I understand what you are saying is you desire the kernel to provide the interface to expose performance information like "X% of accesses of high priority jobs is from high tier", and knobs for user space to tell kernel to re-balance pages on a per job class (or cgroup) basis based on this information. The page re-balancing will be initiated by user space rather than by the kernel, similar to what Wei proposed. > Basically I am saying until we find from production data > that this background mechanism is not strong enough to enforce passive > limits, we should delay the decision on limit interfaces. > Implementing hard limit does have a number of rough edges on a per node basis. Probably we should first start with doing the proper accounting and exposing the right performance information. Tim