On Fri, Jun 18, 2021 at 3:11 PM Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote: > > > > On 6/17/21 11:48 AM, Shakeel Butt wrote: [...] > > > > At the moment "personally" I am more inclined towards a passive > > approach towards the memcg accounting of memory tiers. By that I mean, > > let's start by providing a 'usage' interface and get more > > production/real-world data to motivate the 'limit' interfaces. (One > > minor reason is that defining the 'limit' interface will force us to > > make the decision on defining tiers i.e. numa or a set of numa or > > others). > > Probably we could first start with accounting the memory used in each > NUMA node for a cgroup and exposing this information to user space. > I think that is useful regardless. > Is memory.numa_stat not good enough? This interface does miss __GFP_ACCOUNT non-slab allocations, percpu and sock. > There is still a question of whether we want to define a set of > numa node or tier and extend the accounting and management at that > memory tier abstraction level. > [...] > > > > To give a more concrete example: Let's say we have a system with two > > memory tiers and multiple low and high priority jobs. For high > > priority jobs, set the allocation try list from high to low tier and > > for low priority jobs the reverse of that (I am not sure if we can do > > that out of the box with today's kernel). In the background we migrate > > cold memory down the tiers and hot memory in the reverse direction. > > > > In this background mechanism we can enforce all different limiting > > policies like Yang's original high and low tier percentage or > > something like X% of accesses of high priority jobs should be from > > high tier. > > If I understand what you are saying is you desire the kernel to provide > the interface to expose performance information like > "X% of accesses of high priority jobs is from high tier", I think we can estimate "X% of accesses to high tier" using existing perf/PMU counters. So, no new interface. > and knobs for user space to tell kernel to re-balance pages on > a per job class (or cgroup) basis based on this information. > The page re-balancing will be initiated by user space rather than > by the kernel, similar to what Wei proposed. This is more open to discussion and we should brainstorm the pros and cons of all proposed approaches.