Re: [RFC] memory tiering: use small chunk size and more tiers

Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> · Fri, 28 Oct 2022 10:35:44 +0530

On 10/28/22 8:33 AM, Huang, Ying wrote:
> Hi, Aneesh,
> 
> Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> writes:
> 
>> On 10/27/22 12:29 PM, Huang Ying wrote:
>>> We need some way to override the system default memory tiers.  For
>>> the example system as follows,
>>>
>>> type		abstract distance
>>> ----		-----------------
>>> HBM		300
>>> DRAM		1000
>>> CXL_MEM		5000
>>> PMEM		5100
>>>
>>> Given the memory tier chunk size is 100, the default memory tiers
>>> could be,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 3		300-400			HBM
>>> 10		1000-1100		DRAM
>>> 50		5000-5100		CXL_MEM
>>> 51		5100-5200		PMEM
>>>
>>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>>>
>>> 1) Override the abstract distance of CXL_MEM or PMEM.  For example, if
>>> we change the abstract distance of PMEM to 5050, the memory tiers
>>> become,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 3		300-400			HBM
>>> 10		1000-1100		DRAM
>>> 50		5000-5100		CXL_MEM, PMEM
>>>
>>> 2) Override the memory tier chunk size.  For example, if we change the
>>> memory tier chunk size to 200, the memory tiers become,
>>>
>>> tier		abstract distance	types
>>>                 range
>>> ----		-----------------       -----
>>> 1		200-400			HBM
>>> 5		1000-1200		DRAM
>>> 25		5000-5200		CXL_MEM, PMEM
>>>
>>> But after some thoughts, I think choice 2) may be not good.  The
>>> problem is that even if 2 abstract distances are almost same, they may
>>> be put in 2 tier if they sit in the different sides of the tier
>>> boundary.  For example, if the abstract distance of CXL_MEM is 4990,
>>> while the abstract distance of PMEM is 5010.  Although the difference
>>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>>> This makes choice 2) hard to be used, it may become tricky to find out
>>> the appropriate tier chunk size that satisfying all requirements.
>>>
>>
>> Shouldn't we wait for gaining experience w.r.t how we would end up
>> mapping devices with different latencies and bandwidth before tuning these values? 
> 
> Just want to discuss the overall design.
> 
>>> So I suggest to abandon choice 2) and use choice 1) only.  This makes
>>> the overall design and user space interface to be simpler and easier
>>> to be used.  The overall design of the abstract distance could be,
>>>
>>> 1. Use decimal for abstract distance and its chunk size.  This makes
>>>    them more user friendly.
>>>
>>> 2. Make the tier chunk size as small as possible.  For example, 10.
>>>    This will put different memory types in one memory tier only if their
>>>    performance is almost same by default.  And we will not provide the
>>>    interface to override the chunk size.
>>>
>>
>> this could also mean we can end up with lots of memory tiers with relative
>> smaller performance difference between them. Again it depends how HMAT
>> attributes will be used to map to abstract distance.
> 
> Per my understanding, there will not be many memory types in a system.
> So, there will not be many memory tiers too.  In most systems, there are
> only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
> etc. 

So we don't need the chunk size to be 10 because we don't forsee us needing
to group devices into that many tiers. 

> Do you know systems with many memory types?  The basic idea is to
> put different memory types in different memory tiers by default.  If
> users want to group them, they can do that via overriding the abstract
> distance of some memory type.
> 

with small chunk size and depending on how we are going to derive abstract distance,
I am wondering whether we would end up with lots of memory tiers with no 
real value. Hence my suggestion to wait making a change like this till we have
code that map HMAT/CDAT attributes to abstract distance. 

>>
>>> 3. Make the abstract distance of normal DRAM large enough.  For
>>>    example, 1000, then 100 tiers can be defined below DRAM, this is
>>>    more than enough in practice.
>>
>> Why 100? Will we really have that many tiers below/faster than DRAM? As of now 
>> I see only HBM below it.
> 
> Yes.  100 is more than enough.  We just want to avoid to group different
> memory types by default.
> 
> Best Regards,
> Huang, Ying
>