Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> · Wed, 14 Apr 2021 17:42:39 -0700

On 4/12/21 12:20 PM, Shakeel Butt wrote:

>>
>> memory_t0.current       Current usage of tier 0 memory by the cgroup.
>>
>> memory_t0.min           If tier 0 memory used by the cgroup falls below this low
>>                         boundary, the memory will not be subjected to demotion
>>                         to lower tiers to free up memory at tier 0.
>>
>> memory_t0.low           Above this boundary, the tier 0 memory will be subjected
>>                         to demotion.  The demotion pressure will be proportional
>>                         to the overage.
>>
>> memory_t0.high          If tier 0 memory used by the cgroup exceeds this high
>>                         boundary, allocation of tier 0 memory by the cgroup will
>>                         be throttled. The tier 0 memory used by this cgroup
>>                         will also be subjected to heavy demotion.
>>
>> memory_t0.max           This will be a hard usage limit of tier 0 memory on the cgroup.
>>
>> If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
>> This follows closely with the design of the general memory controller interface.
>>
>> Will such an interface looks sane and acceptable with everyone?
>>
> 
> I have a couple of questions. Let's suppose we have a two socket
> system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
> 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
> Based on the tier definition of this patch series, tier_0: {node_0,
> node_1} and tier_1: {node_2, node_3}.
> 
> My questions are:
> 
> 1) Can we assume that the cost of access within a tier will always be
> less than the cost of access from the tier? (node_0 <-> node_1 vs
> node_0 <-> node_2)

I do assume that higher tier memory offers better performance (or less
access latency) than a lower tier memory.  Otherwise, this defeats the
whole purpose of promoting hot memory from lower tier to a higher tier,
and demoting cold memory to a lower tier.

Tiers assumption is embedded once we define this promotion/demotion relationship
between the numa nodes.

So if 

  node_m ----demotes----> node_n
         <---promotes---- 

then node_m is one tier higher tier than node_n. This promotion/demotion
relationship between the nodes is the underpinning of Dave and Ying's
demotion and promotion patch sets.  

> 2) If yes to (1), is that assumption future proof? Will the future
> systems with DRAM over CXL support have the same characteristics?

I think if you configure a promotion/demotion relationship between
DRAM over CXL and local-socket connected DRAM, you could divide them
up into separate tiers.  Or you don't care about the difference and
you will configure them not to have a promotion/demotion relationship
and they will be at the same tier.  Balance within the same tier
will be effected by the autonuma mechanism.

> 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
> <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
> might be third tier and similarly for jobs running on node_1, node_2
> might be third tier.

Tier definition is an admin's choice, of where the admin think the
hot memory should reside after looking at the memory performance.
It falls out of how the admin construct the promotion/demotion relationship
between the nodes and OS does not assume the tier relationship from
memory performance directly. 

> 
> The reason I am asking these questions is that the statically
> partitioning memory nodes into tiers will inherently add platform
> specific assumptions in the user API.
> 
> Assumptions like:
> 1) Access within tier is always cheaper than across tier.
> 2) Access from tier_i to tier_i+1 has uniform cost.
> 
> The reason I am more inclined towards having numa centric control is
> that we don't have to make these assumptions. Though the usability
> will be more difficult. Greg (CCed) has some ideas on making it better
> and we will share our proposal after polishing it a bit more.
> 

I am still trying to understand how a numa centric control actually
work. Putting limits on every numa node for each cgroup
seems to make the system configuration quite complicated.  Looking
forward to your proposal so I can better understand that perspective.

Tim