Le 23/03/2019 à 05:44, Yang Shi a écrit : > With Dave Hansen's patches merged into Linus's tree > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node > effectively and efficiently is still a question. > > There have been a couple of proposals posted on the mailing list [1] [2]. > > The patchset is aimed to try a different approach from this proposal [1] > to use PMEM as NUMA nodes. > > The approach is designed to follow the below principles: > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc. > > 2. DRAM first/by default. No surprise to existing applications and default > running. PMEM will not be allocated unless its node is specified explicitly > by NUMA policy. Some applications may be not very sensitive to memory latency, > so they could be placed on PMEM nodes then have hot pages promote to DRAM > gradually. I am not against the approach for some workloads. However, many HPC people would rather do this manually. But there's currently no easy way to find out from userspace whether a given NUMA node is DDR or PMEM*. We have to assume HMAT is available (and correct) and look at performance attributes. When talking to humans, it would be better to say "I allocated on the local DDR NUMA node" rather than "I allocated on the fastest node according to HMAT latency". Also, when we'll have HBM+DDR, some applications may want to use DDR by default, which means they want the *slowest* node according to HMAT (by the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?). Performance attributes could help, but how does user-space know for sure that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years? It seems to me that exporting a flag in sysfs saying whether a node is PMEM could be convenient. Patch series [1] exported a "type" in sysfs node directories ("pmem" or "dram"). I don't know how if there's an easy way to define what HBM is and expose that type too. Brice * As far as I know, the only way is to look at all DAX devices until you find the given NUMA node in the "target_node" attribute. If none, you're likely not PMEM-backed. > [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@xxxxxxxxx/