On Wed, Mar 27, 2019 at 10:34:11AM -0700, Dan Williams wrote: > On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > No, Linux NUMA implementation makes all numa nodes available by default > > and provides an API to opt-in for more fine tuning. What you are > > suggesting goes against that semantic and I am asking why. How is pmem > > NUMA node any different from any any other distant node in principle? > > Agree. It's just another NUMA node and shouldn't be special cased. > Userspace policy can choose to avoid it, but typical node distance > preference should otherwise let the kernel fall back to it as > additional memory pressure relief for "near" memory. I think this is sort of true, but sort of different. These are essentially CPU-less nodes; there is no CPU for which they are fast memory. Yes, they're further from some CPUs than from others. I have never paid attention to how Linux treats CPU-less memory nodes, but it would make sense to me if we don't default to allocating from remote nodes. And treating pmem nodes as being remote from all CPUs makes a certain amount of sense to me. eg on a four CPU-socket system, consider this as being pmem1 --- node1 --- node2 --- pmem2 | \ / | | X | | / \ | pmem3 --- node3 --- node4 --- pmem4 which I could actually see someone building with normal DRAM, and we should probably handle the same way as pmem; for a process running on node3, allocate preferentially from node3, then pmem3, then other nodes, then other pmems.