On 02/11/2023 13:18, Huang, Ying wrote: > "Zhijian Li (Fujitsu)" <lizhijian@xxxxxxxxxxx> writes: > >>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >>> already. A node in a higher tier can demote to any node in the lower >>> tiers. What's more need to be displayed in nodeX/demotion_nodes? >> >> IIRC, they are not the same. memory_tier[number], where the number is shared by >> the memory using the same memory driver(dax/kmem etc). Not reflect the actual distance >> across nodes(different distance will be grouped into the same memory_tier). >> But demotion will only select the nearest nodelist to demote. > > In the following patchset, we will use the performance information from > HMAT to place nodes using the same memory driver into different memory > tiers. > > https://lore.kernel.org/all/20230926060628.265989-1-ying.huang@xxxxxxxxx/ Thanks for your reminder. It seems like I've fallen behind the world by months. I will rebase on it later if this patch is still needed. > > The patch is in mm-stable tree. > >> Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but distance to DRAM nodes >> are different. >> >> # numactl -H >> available: 4 nodes (0-3) >> node 0 cpus: 0 >> node 0 size: 964 MB >> node 0 free: 746 MB >> node 1 cpus: 1 >> node 1 size: 685 MB >> node 1 free: 455 MB >> node 2 cpus: >> node 2 size: 896 MB >> node 2 free: 897 MB >> node 3 cpus: >> node 3 size: 896 MB >> node 3 free: 896 MB >> node distances: >> node 0 1 2 3 >> 0: 10 20 20 25 >> 1: 20 10 25 20 >> 2: 20 25 10 20 >> 3: 25 20 20 10 >> # cat /sys/devices/system/node/node0/demotion_nodes >> 2 > > node 2 is only the preferred demotion target. In fact, memory in node 0 > can be demoted to node 2,3. Please check demote_folio_list() for > details. Have I missed something, at least the on master tree, nd->preferred only include the nearest ones(by specific algorithms), so in above numa topology, nd->preferred of node0 is node2 only. node0 distance to node3 is 25 greater than to node2(20). > 1657 int target_nid = next_demotion_node(pgdat->node_id); So target_nid cannot be node3 IIUC. (I cooked this patches weeks ago, maybe something has changed, i will also take a deep look later.) 1650 /* 1651 * Take folios on @demote_folios and attempt to demote them to another node. 1652 * Folios which are not demoted are left on @demote_folios. 1653 */ 1654 static unsigned int demote_folio_list(struct list_head *demote_folios, 1655 struct pglist_data *pgdat) 1656 { 1657 int target_nid = next_demotion_node(pgdat->node_id); 1658 unsigned int nr_succeeded; 1659 nodemask_t allowed_mask; 1660 1661 struct migration_target_control mtc = { 1662 /* 1663 * Allocate from 'node', or fail quickly and quietly. 1664 * When this happens, 'page' will likely just be discarded 1665 * instead of migrated. 1666 */ 1667 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | 1668 __GFP_NOMEMALLOC | GFP_NOWAIT, 1669 .nid = target_nid, 1670 .nmask = &allowed_mask 1671 }; 1672 1673 if (list_empty(demote_folios)) 1674 return 0; 1675 1676 if (target_nid == NUMA_NO_NODE) 1677 return 0; 1678 1679 node_get_allowed_targets(pgdat, &allowed_mask); 1680 1681 /* Demotion ignores all cpuset and mempolicy settings */ 1682 migrate_pages(demote_folios, alloc_demote_folio, NULL, 1683 (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, 1684 &nr_succeeded); > > -- > Best Regards, > Huang, Ying > >> # cat /sys/devices/system/node/node1/demotion_nodes >> 3 >> # cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist >> 2-3 >> >> Thanks >> Zhijian >> >> (I hate the outlook native reply composition format.) >> ________________________________________ >> From: Huang, Ying <ying.huang@xxxxxxxxx> >> Sent: Thursday, November 2, 2023 11:17 >> To: Li, Zhijian/李 智坚 >> Cc: Andrew Morton; Greg Kroah-Hartman; rafael@xxxxxxxxxx; linux-mm@xxxxxxxxx; Gotou, Yasunori/五島 康文; linux-kernel@xxxxxxxxxxxxxxx >> Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface >> >> Li Zhijian <lizhijian@xxxxxxxxxxx> writes: >> >>> It shows the demotion target nodes of a node. Export this information to >>> user directly. >>> >>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node. >>> - Before PMEM is online, no demotion_nodes for node0 and node1. >>> $ cat /sys/devices/system/node/node0/demotion_nodes >>> <show nothing> >>> - After node3 is online as kmem >>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0 >>> [ >>> { >>> "chardev":"dax0.0", >>> "size":1054867456, >>> "target_node":3, >>> "align":2097152, >>> "mode":"system-ram", >>> "online_memblocks":0, >>> "total_memblocks":7 >>> } >>> ] >>> $ cat /sys/devices/system/node/node0/demotion_nodes >>> 3 >>> $ cat /sys/devices/system/node/node1/demotion_nodes >>> 3 >>> $ cat /sys/devices/system/node/node3/demotion_nodes >>> <show nothing> >> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >> already. A node in a higher tier can demote to any node in the lower >> tiers. What's more need to be displayed in nodeX/demotion_nodes? >> >> -- >> Best Regards, >> Huang, Ying >> >>> Signed-off-by: Li Zhijian <lizhijian@xxxxxxxxxxx> >>> --- >>> drivers/base/node.c | 13 +++++++++++++ >>> include/linux/memory-tiers.h | 6 ++++++ >>> mm/memory-tiers.c | 8 ++++++++ >>> 3 files changed, 27 insertions(+) >>> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c >>> index 493d533f8375..27e8502548a7 100644 >>> --- a/drivers/base/node.c >>> +++ b/drivers/base/node.c >>> @@ -7,6 +7,7 @@ >>> #include <linux/init.h> >>> #include <linux/mm.h> >>> #include <linux/memory.h> >>> +#include <linux/memory-tiers.h> >>> #include <linux/vmstat.h> >>> #include <linux/notifier.h> >>> #include <linux/node.h> >>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev, >>> } >>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); >>> >>> +static ssize_t demotion_nodes_show(struct device *dev, >>> + struct device_attribute *attr, char *buf) >>> +{ >>> + int ret; >>> + nodemask_t nmask = next_demotion_nodes(dev->id); >>> + >>> + ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask)); >>> + return ret; >>> +} >>> +static DEVICE_ATTR_RO(demotion_nodes); >>> + >>> static struct attribute *node_dev_attrs[] = { >>> &dev_attr_meminfo.attr, >>> &dev_attr_numastat.attr, >>> &dev_attr_distance.attr, >>> &dev_attr_vmstat.attr, >>> + &dev_attr_demotion_nodes.attr, >>> NULL >>> }; >>> >>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >>> index 437441cdf78f..8eb04923f965 100644 >>> --- a/include/linux/memory-tiers.h >>> +++ b/include/linux/memory-tiers.h >>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type); >>> void clear_node_memory_type(int node, struct memory_dev_type *memtype); >>> #ifdef CONFIG_MIGRATION >>> int next_demotion_node(int node); >>> +nodemask_t next_demotion_nodes(int node); >>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); >>> bool node_is_toptier(int node); >>> #else >>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node) >>> return NUMA_NO_NODE; >>> } >>> >>> +static inline next_demotion_nodes next_demotion_nodes(int node) >>> +{ >>> + return NODE_MASK_NONE; >>> +} >>> + >>> static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) >>> { >>> *targets = NODE_MASK_NONE; >>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >>> index 37a4f59d9585..90047f37d98a 100644 >>> --- a/mm/memory-tiers.c >>> +++ b/mm/memory-tiers.c >>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) >>> rcu_read_unlock(); >>> } >>> >>> +nodemask_t next_demotion_nodes(int node) >>> +{ >>> + if (!node_demotion) >>> + return NODE_MASK_NONE; >>> + >>> + return node_demotion[node].preferred; >>> +} >>> + >>> /** >>> * next_demotion_node() - Get the next node in the demotion path >>> * @node: The starting node to lookup the next node