On Wed 27-03-19 11:59:28, Yang Shi wrote: > > > On 3/27/19 10:34 AM, Dan Williams wrote: > > On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > On Tue 26-03-19 19:58:56, Yang Shi wrote: [...] > > > > It is still NUMA, users still can see all the NUMA nodes. > > > No, Linux NUMA implementation makes all numa nodes available by default > > > and provides an API to opt-in for more fine tuning. What you are > > > suggesting goes against that semantic and I am asking why. How is pmem > > > NUMA node any different from any any other distant node in principle? > > Agree. It's just another NUMA node and shouldn't be special cased. > > Userspace policy can choose to avoid it, but typical node distance > > preference should otherwise let the kernel fall back to it as > > additional memory pressure relief for "near" memory. > > In ideal case, yes, I agree. However, in real life world the performance is > a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has > higher latency and lower bandwidth. We observed much higher latency on PMEM > than DRAM with multi threads. One rule of thumb is: Do not design user visible interfaces based on the contemporary technology and its up/down sides. This will almost always fire back. Btw. if you keep arguing about performance without any numbers. Can you present something specific? > In real production environment we don't know what kind of applications would > end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have > unexpected performance degradation. I understand to have mempolicy to choose > to avoid it. But, there might be hundreds or thousands of applications > running on the machine, it sounds not that feasible to me to have each > single application set mempolicy to avoid it. we have cpuset cgroup controller to help here. > So, I think we still need a default allocation node mask. The default value > may include all nodes or just DRAM nodes. But, they should be able to be > override by user globally, not only per process basis. > > Due to the performance disparity, currently our usecases treat PMEM as > second tier memory for demoting cold page or binding to not memory access > sensitive applications (this is the reason for inventing a new mempolicy) > although it is a NUMA node. If the performance sucks that badly then do not use the pmem as NUMA, really. There are certainly other ways to export the pmem storage. Use it as a fast swap storage. Or try to work on a swap caching mechanism that still allows much faster access than a slow swap storage. But do not try to pretend to abuse the NUMA interface while you are breaking some of its long term established semantics. -- Michal Hocko SUSE Labs