RE: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

"Du, Fan" <fan.du@xxxxxxxxx> · Fri, 26 Apr 2019 02:40:05 +0000

>-----Original Message-----
>From: Dan Williams [mailto:dan.j.williams@xxxxxxxxx]
>Sent: Thursday, April 25, 2019 11:43 PM
>To: Du, Fan <fan.du@xxxxxxxxx>
>Cc: Michal Hocko <mhocko@xxxxxxxxxx>; akpm@xxxxxxxxxxxxxxxxxxxx; Wu,
>Fengguang <fengguang.wu@xxxxxxxxx>; Hansen, Dave
><dave.hansen@xxxxxxxxx>; xishi.qiuxishi@xxxxxxxxxxxxxxx; Huang, Ying
><ying.huang@xxxxxxxxx>; linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
>Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>memory system
>
>On Thu, Apr 25, 2019 at 1:05 AM Du, Fan <fan.du@xxxxxxxxx> wrote:
>>
>>
>>
>> >-----Original Message-----
>> >From: owner-linux-mm@xxxxxxxxx [mailto:owner-linux-mm@xxxxxxxxx] On
>> >Behalf Of Michal Hocko
>> >Sent: Thursday, April 25, 2019 3:54 PM
>> >To: Du, Fan <fan.du@xxxxxxxxx>
>> >Cc: akpm@xxxxxxxxxxxxxxxxxxxx; Wu, Fengguang
><fengguang.wu@xxxxxxxxx>;
>> >Williams, Dan J <dan.j.williams@xxxxxxxxx>; Hansen, Dave
>> ><dave.hansen@xxxxxxxxx>; xishi.qiuxishi@xxxxxxxxxxxxxxx; Huang, Ying
>> ><ying.huang@xxxxxxxxx>; linux-mm@xxxxxxxxx;
>linux-kernel@xxxxxxxxxxxxxxx
>> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>> >memory system
>> >
>> >On Thu 25-04-19 07:41:40, Du, Fan wrote:
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Michal Hocko [mailto:mhocko@xxxxxxxxxx]
>> >> >Sent: Thursday, April 25, 2019 2:37 PM
>> >> >To: Du, Fan <fan.du@xxxxxxxxx>
>> >> >Cc: akpm@xxxxxxxxxxxxxxxxxxxx; Wu, Fengguang
>> ><fengguang.wu@xxxxxxxxx>;
>> >> >Williams, Dan J <dan.j.williams@xxxxxxxxx>; Hansen, Dave
>> >> ><dave.hansen@xxxxxxxxx>; xishi.qiuxishi@xxxxxxxxxxxxxxx; Huang, Ying
>> >> ><ying.huang@xxxxxxxxx>; linux-mm@xxxxxxxxx;
>> >linux-kernel@xxxxxxxxxxxxxxx
>> >> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>> >> >memory system
>> >> >
>> >> >On Thu 25-04-19 09:21:30, Fan Du wrote:
>> >> >[...]
>> >> >> However PMEM has different characteristics from DRAM,
>> >> >> the more reasonable or desirable fallback style would be:
>> >> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
>> >> >> When DRAM is exhausted, try PMEM then.
>> >> >
>> >> >Why and who does care? NUMA is fundamentally about memory nodes
>> >with
>> >> >different access characteristics so why is PMEM any special?
>> >>
>> >> Michal, thanks for your comments!
>> >>
>> >> The "different" lies in the local or remote access, usually the underlying
>> >> memory is the same type, i.e. DRAM.
>> >>
>> >> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
>> >> while with different read/write access latency than DRAM.
>> >
>> >You are describing a NUMA in general here. Yes access to different NUMA
>> >nodes has a different read/write latency. But that doesn't make PMEM
>> >really special from a regular DRAM.
>>
>> Not the numa distance b/w cpu and PMEM node make PMEM different
>than
>> DRAM. The difference lies in the physical layer. The access latency
>characteristics
>> comes from media level.
>
>No, there is no such thing as a "PMEM node". I've pushed back on this
>broken concept in the past [1] [2]. Consider that PMEM could be as
>fast as DRAM for technologies like NVDIMM-N or in emulation
>environments. These attempts to look at persistence as an attribute of
>performance are entirely missing the point that the system can have
>multiple varied memory types and the platform firmware needs to
>enumerate these performance properties in the HMAT on ACPI platforms.
>Any scheme that only considers a binary DRAM and not-DRAM property is
>immediately invalidated the moment the OS needs to consider a 3rd or
>4th memory type, or a more varied connection topology.

Dan, Thanks for your comments!

I've understood your point from the very beginning time of your post before.
Below is my something in my mind as a [standalone personal contributor] only:
a. I fully recognized what HMAT is designed for.
b. I understood your point for the "type" thing is temporal, and think you are right about your
  point.

A generic approach is indeed required, however I what to elaborate the point of the problem
I'm trying to solve for customer, not how we and other people solve it one way or another..

Customer require to fully utilized system memory, no matter DRAM, 1st generation PMEM,
future xth generation PMEM which beats DRAM.
Customer require to explicitly [coarse grained] control the memory allocation for different
latency/bandwidth.

Maybe it's more worthwhile to think what is needed essentially to solve the problem,
And make sure it scale well enough for some period.

a. Build fallback list for heterogeneous system.
  I prefer to build it per HMAT, because HMAT expose the latency/bandwidth from local node
  Perspective, it's already standardized in ACPI Spec. NUMA node distance from SLIT wouldn't be
  more accurately helpful for heterogeneous memory system anymore.

b. Provide explicit page allocation option for frequently read accessed pages request.
  This requirement is well justified as well. All scenario both in kernel or user level, don't care about
  write latency should leverage this option to archive overall optimal performance.

c. NUMA balancing for heterogeneous system.
  I'm aware of this topic, but it's not what I in mind(a. b.) right now.

>[1]:
>https://lore.kernel.org/lkml/CAPcyv4heiUbZvP7Ewoy-Hy=-mPrdjCjEuSw+0rwd
>OUHdjwetxg@xxxxxxxxxxxxxx/
>
>[2]:
>https://lore.kernel.org/lkml/CAPcyv4it1w7SdDVBV24cRCVHtLb3s1pVB5+SDM0
>2Uw4RbahKiA@xxxxxxxxxxxxxx/