Wei Xu <weixugc@xxxxxxxxxx> writes: [...] >> > >> > >> > Tiering Hierarchy Initialization >> > `==============================' >> > >> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY). >> > >> > A device driver can remove its memory nodes from the top tier, e.g. >> > a dax driver can remove PMEM nodes from the top tier. >> >> With the topology built by firmware we should not need this. I agree that in an ideal world the hierarchy should be built by firmware based on something like the HMAT. But I also think being able to override this will be useful in getting there. Therefore a way of overriding the generated hierarchy would be good, either via sysfs or kernel boot parameter if we don't want to commit to a particular user interface now. However I'm less sure letting device-drivers override this is a good idea. How for example would a GPU driver make sure it's node is in the top tier? By moving every node that the driver does not know about out of N_TOPTIER_MEMORY? That could get messy if say there were two drivers both of which wanted their node to be in the top tier. > I agree. But before we have such a firmware, the kernel needs to do > its best to initialize memory tiers. > > Given that we know PMEM is slower than DRAM, but a dax device might > not be PMEM, a better place to set the tier for PMEM nodes can be the > ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine > the ACPI_SRAT_MEM_NON_VOLATILE bit. > >> > >> > The kernel builds the memory tiering hierarchy and per-node demotion >> > order tier-by-tier starting from N_TOPTIER_MEMORY. For a node N, the >> > best distance nodes in the next lower tier are assigned to >> > node_demotion[N].preferred and all the nodes in the next lower tier >> > are assigned to node_demotion[N].allowed. >> >> I'm not sure whether it should be allowed to demote to multiple lower >> tiers. But it is totally fine to *NOT* allow it at the moment. Once we >> figure out a good way to define demotion targets, it could be extended >> to support this easily. > > You mean to only support MAX_TIERS=2 for now. I am fine with that. > There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is > not clear yet whether we want to enable transparent memory tiering to > all the 3 tiers on such systems. At some point I think we will need to deal with 3 tiers but I'd be ok with limiting it to 2 for now if it makes things simpler. - Alistair >> > >> > node_demotion[N].preferred can be empty if no preferred demotion node >> > is available for node N. >> > >> > If the userspace overrides the tiers via the memory_tiers sysfs >> > interface, the kernel then only rebuilds the per-node demotion order >> > accordingly. >> > >> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a >> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU >> > node. >> > >> > >> > Memory Allocation for Demotion >> > `============================' >> > >> > When allocating a new demotion target page, both a preferred node >> > and the allowed nodemask are provided to the allocation function. >> > The default kernel allocation fallback order is used to allocate the >> > page from the specified node and nodemask. >> > >> > The memopolicy of cpuset, vma and owner task of the source page can >> > be set to refine the demotion nodemask, e.g. to prevent demotion or >> > select a particular allowed node as the demotion target. >> > >> > >> > Examples >> > `======' >> > >> > * Example 1: >> > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. >> > >> > Node 0 has node 2 as the preferred demotion target and can also >> > fallback demotion to node 3. >> > >> > Node 1 has node 3 as the preferred demotion target and can also >> > fallback demotion to node 2. >> > >> > Set mempolicy to prevent cross-socket demotion and memory access, >> > e.g. cpuset.mems=0,2 >> > >> > node distances: >> > node 0 1 2 3 >> > 0 10 20 30 40 >> > 1 20 10 40 30 >> > 2 30 40 10 40 >> > 3 40 30 40 10 >> > >> > /sys/devices/system/node/memory_tiers >> > 0-1 >> > 2-3 >> > >> > N_TOPTIER_MEMORY: 0-1 >> > >> > node_demotion[]: >> > 0: [2], [2-3] >> > 1: [3], [2-3] >> > 2: [], [] >> > 3: [], [] >> > >> > * Example 2: >> > Node 0 & 1 are DRAM nodes. >> > Node 2 is a PMEM node and closer to node 0. >> > >> > Node 0 has node 2 as the preferred and only demotion target. >> > >> > Node 1 has no preferred demotion target, but can still demote >> > to node 2. >> > >> > Set mempolicy to prevent cross-socket demotion and memory access, >> > e.g. cpuset.mems=0,2 >> > >> > node distances: >> > node 0 1 2 >> > 0 10 20 30 >> > 1 20 10 40 >> > 2 30 40 10 >> > >> > /sys/devices/system/node/memory_tiers >> > 0-1 >> > 2 >> > >> > N_TOPTIER_MEMORY: 0-1 >> > >> > node_demotion[]: >> > 0: [2], [2] >> > 1: [], [2] >> > 2: [], [] >> > >> > >> > * Example 3: >> > Node 0 & 1 are DRAM nodes. >> > Node 2 is a PMEM node and has the same distance to node 0 & 1. >> > >> > Node 0 has node 2 as the preferred and only demotion target. >> > >> > Node 1 has node 2 as the preferred and only demotion target. >> > >> > node distances: >> > node 0 1 2 >> > 0 10 20 30 >> > 1 20 10 30 >> > 2 30 30 10 >> > >> > /sys/devices/system/node/memory_tiers >> > 0-1 >> > 2 >> > >> > N_TOPTIER_MEMORY: 0-1 >> > >> > node_demotion[]: >> > 0: [2], [2] >> > 1: [2], [2] >> > 2: [], [] >> > >> > >> > * Example 4: >> > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node. >> > >> > All nodes are top-tier. >> > >> > node distances: >> > node 0 1 2 >> > 0 10 20 30 >> > 1 20 10 30 >> > 2 30 30 10 >> > >> > /sys/devices/system/node/memory_tiers >> > 0-2 >> > >> > N_TOPTIER_MEMORY: 0-2 >> > >> > node_demotion[]: >> > 0: [], [] >> > 1: [], [] >> > 2: [], [] >> > >> > >> > * Example 5: >> > Node 0 is a DRAM node with CPU. >> > Node 1 is a HBM node. >> > Node 2 is a PMEM node. >> > >> > With userspace override, node 1 is the top tier and has node 0 as >> > the preferred and only demotion target. >> > >> > Node 0 is in the second tier, tier 1, and has node 2 as the >> > preferred and only demotion target. >> > >> > Node 2 is in the lowest tier, tier 2, and has no demotion targets. >> > >> > node distances: >> > node 0 1 2 >> > 0 10 21 30 >> > 1 21 10 40 >> > 2 30 40 10 >> > >> > /sys/devices/system/node/memory_tiers (userspace override) >> > 1 >> > 0 >> > 2 >> > >> > N_TOPTIER_MEMORY: 1 >> > >> > node_demotion[]: >> > 0: [2], [2] >> > 1: [0], [0] >> > 2: [], [] >> > >> > -- Wei