On Mon, Mar 25, 2019 at 9:15 AM Brice Goglin <Brice.Goglin@xxxxxxxx> wrote: > > > Le 23/03/2019 à 05:44, Yang Shi a écrit : > > With Dave Hansen's patches merged into Linus's tree > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node > > effectively and efficiently is still a question. > > > > There have been a couple of proposals posted on the mailing list [1] [2]. > > > > The patchset is aimed to try a different approach from this proposal [1] > > to use PMEM as NUMA nodes. > > > > The approach is designed to follow the below principles: > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. > > > > 2. DRAM first/by default. No surprise to existing applications and default > > running. PMEM will not be allocated unless its node is specified explicitly > > by NUMA policy. Some applications may be not very sensitive to memory latency, > > so they could be placed on PMEM nodes then have hot pages promote to DRAM > > gradually. > > > I am not against the approach for some workloads. However, many HPC > people would rather do this manually. But there's currently no easy way > to find out from userspace whether a given NUMA node is DDR or PMEM*. We > have to assume HMAT is available (and correct) and look at performance > attributes. When talking to humans, it would be better to say "I > allocated on the local DDR NUMA node" rather than "I allocated on the > fastest node according to HMAT latency". > > Also, when we'll have HBM+DDR, some applications may want to use DDR by > default, which means they want the *slowest* node according to HMAT (by > the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?). > Performance attributes could help, but how does user-space know for sure > that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years? > > It seems to me that exporting a flag in sysfs saying whether a node is > PMEM could be convenient. Patch series [1] exported a "type" in sysfs > node directories ("pmem" or "dram"). I don't know how if there's an easy > way to define what HBM is and expose that type too. I'm generally against the concept that a "pmem" or "type" flag should indicate anything about the expected performance of the address range. The kernel should explicitly look to the HMAT for performance data and not otherwise make type-based performance assumptions.