Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 27 Mar 2019 13:35:20 -0700

On Wed, Mar 27, 2019 at 10:34:11AM -0700, Dan Williams wrote:
> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > No, Linux NUMA implementation makes all numa nodes available by default
> > and provides an API to opt-in for more fine tuning. What you are
> > suggesting goes against that semantic and I am asking why. How is pmem
> > NUMA node any different from any any other distant node in principle?
> 
> Agree. It's just another NUMA node and shouldn't be special cased.
> Userspace policy can choose to avoid it, but typical node distance
> preference should otherwise let the kernel fall back to it as
> additional memory pressure relief for "near" memory.

I think this is sort of true, but sort of different.  These are
essentially CPU-less nodes; there is no CPU for which they are
fast memory.  Yes, they're further from some CPUs than from others.
I have never paid attention to how Linux treats CPU-less memory nodes,
but it would make sense to me if we don't default to allocating from
remote nodes.  And treating pmem nodes as being remote from all CPUs
makes a certain amount of sense to me.

eg on a four CPU-socket system, consider this as being

pmem1 --- node1 --- node2 --- pmem2
            |   \ /   |
            |    X    |
            |   / \   |
pmem3 --- node3 --- node4 --- pmem4

which I could actually see someone building with normal DRAM, and we
should probably handle the same way as pmem; for a process running on
node3, allocate preferentially from node3, then pmem3, then other nodes,
then other pmems.