"Zhijian Li (Fujitsu)" <lizhijian@xxxxxxxxxxx> writes: > On 02/11/2023 13:18, Huang, Ying wrote: >> "Zhijian Li (Fujitsu)" <lizhijian@xxxxxxxxxxx> writes: >> >>>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >>>> already. A node in a higher tier can demote to any node in the lower >>>> tiers. What's more need to be displayed in nodeX/demotion_nodes? >>> >>> IIRC, they are not the same. memory_tier[number], where the number is shared by >>> the memory using the same memory driver(dax/kmem etc). Not reflect the actual distance >>> across nodes(different distance will be grouped into the same memory_tier). >>> But demotion will only select the nearest nodelist to demote. >> >> In the following patchset, we will use the performance information from >> HMAT to place nodes using the same memory driver into different memory >> tiers. >> >> https://lore.kernel.org/all/20230926060628.265989-1-ying.huang@xxxxxxxxx/ > > Thanks for your reminder. It seems like I've fallen behind the world by months. > I will rebase on it later if this patch is still needed. > >> >> The patch is in mm-stable tree. >> >>> Below is an example, node0 node1 are DRAM, node2 node3 are PMEM, but distance to DRAM nodes >>> are different. >>> >>> # numactl -H >>> available: 4 nodes (0-3) >>> node 0 cpus: 0 >>> node 0 size: 964 MB >>> node 0 free: 746 MB >>> node 1 cpus: 1 >>> node 1 size: 685 MB >>> node 1 free: 455 MB >>> node 2 cpus: >>> node 2 size: 896 MB >>> node 2 free: 897 MB >>> node 3 cpus: >>> node 3 size: 896 MB >>> node 3 free: 896 MB >>> node distances: >>> node 0 1 2 3 >>> 0: 10 20 20 25 >>> 1: 20 10 25 20 >>> 2: 20 25 10 20 >>> 3: 25 20 20 10 >>> # cat /sys/devices/system/node/node0/demotion_nodes >>> 2 >> >> node 2 is only the preferred demotion target. In fact, memory in node 0 >> can be demoted to node 2,3. Please check demote_folio_list() for >> details. > > Have I missed something, at least the on master tree, nd->preferred only include the > nearest ones(by specific algorithms), so in above numa topology, nd->preferred of > node0 is node2 only. node0 distance to node3 is 25 greater than to node2(20). > >> 1657 int target_nid = next_demotion_node(pgdat->node_id); > > So target_nid cannot be node3 IIUC. > > (I cooked this patches weeks ago, maybe something has changed, i will also take a deep look later.) > > 1650 /* > 1651 * Take folios on @demote_folios and attempt to demote them to another node. > 1652 * Folios which are not demoted are left on @demote_folios. > 1653 */ > 1654 static unsigned int demote_folio_list(struct list_head *demote_folios, > 1655 struct pglist_data *pgdat) > 1656 { > 1657 int target_nid = next_demotion_node(pgdat->node_id); > 1658 unsigned int nr_succeeded; > 1659 nodemask_t allowed_mask; > 1660 > 1661 struct migration_target_control mtc = { > 1662 /* > 1663 * Allocate from 'node', or fail quickly and quietly. > 1664 * When this happens, 'page' will likely just be discarded > 1665 * instead of migrated. > 1666 */ > 1667 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | > 1668 __GFP_NOMEMALLOC | GFP_NOWAIT, > 1669 .nid = target_nid, > 1670 .nmask = &allowed_mask > 1671 }; > 1672 > 1673 if (list_empty(demote_folios)) > 1674 return 0; > 1675 > 1676 if (target_nid == NUMA_NO_NODE) > 1677 return 0; > 1678 > 1679 node_get_allowed_targets(pgdat, &allowed_mask); > 1680 > 1681 /* Demotion ignores all cpuset and mempolicy settings */ > 1682 migrate_pages(demote_folios, alloc_demote_folio, NULL, > 1683 (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > 1684 &nr_succeeded); > In alloc_demote_folio(), target_nid is tried firstly. Then, if allocation fails, any node in allowed_mask will be tried. -- Best Regards, Huang, Ying >> >>> # cat /sys/devices/system/node/node1/demotion_nodes >>> 3 >>> # cat /sys/devices/virtual/memory_tiering/memory_tier22/nodelist >>> 2-3 >>> >>> Thanks >>> Zhijian >>> >>> (I hate the outlook native reply composition format.) >>> ________________________________________ >>> From: Huang, Ying <ying.huang@xxxxxxxxx> >>> Sent: Thursday, November 2, 2023 11:17 >>> To: Li, Zhijian/李 智坚 >>> Cc: Andrew Morton; Greg Kroah-Hartman; rafael@xxxxxxxxxx; linux-mm@xxxxxxxxx; Gotou, Yasunori/五島 康文; linux-kernel@xxxxxxxxxxxxxxx >>> Subject: Re: [PATCH RFC 1/4] drivers/base/node: Add demotion_nodes sys infterface >>> >>> Li Zhijian <lizhijian@xxxxxxxxxxx> writes: >>> >>>> It shows the demotion target nodes of a node. Export this information to >>>> user directly. >>>> >>>> Below is an example where node0 node1 are DRAM, node3 is a PMEM node. >>>> - Before PMEM is online, no demotion_nodes for node0 and node1. >>>> $ cat /sys/devices/system/node/node0/demotion_nodes >>>> <show nothing> >>>> - After node3 is online as kmem >>>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && daxctl online-memory dax0.0 >>>> [ >>>> { >>>> "chardev":"dax0.0", >>>> "size":1054867456, >>>> "target_node":3, >>>> "align":2097152, >>>> "mode":"system-ram", >>>> "online_memblocks":0, >>>> "total_memblocks":7 >>>> } >>>> ] >>>> $ cat /sys/devices/system/node/node0/demotion_nodes >>>> 3 >>>> $ cat /sys/devices/system/node/node1/demotion_nodes >>>> 3 >>>> $ cat /sys/devices/system/node/node3/demotion_nodes >>>> <show nothing> >>> >>> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >>> already. A node in a higher tier can demote to any node in the lower >>> tiers. What's more need to be displayed in nodeX/demotion_nodes? >>> >>> -- >>> Best Regards, >>> Huang, Ying >>> >>>> Signed-off-by: Li Zhijian <lizhijian@xxxxxxxxxxx> >>>> --- >>>> drivers/base/node.c | 13 +++++++++++++ >>>> include/linux/memory-tiers.h | 6 ++++++ >>>> mm/memory-tiers.c | 8 ++++++++ >>>> 3 files changed, 27 insertions(+) >>>> >>>> diff --git a/drivers/base/node.c b/drivers/base/node.c >>>> index 493d533f8375..27e8502548a7 100644 >>>> --- a/drivers/base/node.c >>>> +++ b/drivers/base/node.c >>>> @@ -7,6 +7,7 @@ >>>> #include <linux/init.h> >>>> #include <linux/mm.h> >>>> #include <linux/memory.h> >>>> +#include <linux/memory-tiers.h> >>>> #include <linux/vmstat.h> >>>> #include <linux/notifier.h> >>>> #include <linux/node.h> >>>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device *dev, >>>> } >>>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); >>>> >>>> +static ssize_t demotion_nodes_show(struct device *dev, >>>> + struct device_attribute *attr, char *buf) >>>> +{ >>>> + int ret; >>>> + nodemask_t nmask = next_demotion_nodes(dev->id); >>>> + >>>> + ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask)); >>>> + return ret; >>>> +} >>>> +static DEVICE_ATTR_RO(demotion_nodes); >>>> + >>>> static struct attribute *node_dev_attrs[] = { >>>> &dev_attr_meminfo.attr, >>>> &dev_attr_numastat.attr, >>>> &dev_attr_distance.attr, >>>> &dev_attr_vmstat.attr, >>>> + &dev_attr_demotion_nodes.attr, >>>> NULL >>>> }; >>>> >>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >>>> index 437441cdf78f..8eb04923f965 100644 >>>> --- a/include/linux/memory-tiers.h >>>> +++ b/include/linux/memory-tiers.h >>>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type); >>>> void clear_node_memory_type(int node, struct memory_dev_type *memtype); >>>> #ifdef CONFIG_MIGRATION >>>> int next_demotion_node(int node); >>>> +nodemask_t next_demotion_nodes(int node); >>>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); >>>> bool node_is_toptier(int node); >>>> #else >>>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node) >>>> return NUMA_NO_NODE; >>>> } >>>> >>>> +static inline next_demotion_nodes next_demotion_nodes(int node) >>>> +{ >>>> + return NODE_MASK_NONE; >>>> +} >>>> + >>>> static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) >>>> { >>>> *targets = NODE_MASK_NONE; >>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >>>> index 37a4f59d9585..90047f37d98a 100644 >>>> --- a/mm/memory-tiers.c >>>> +++ b/mm/memory-tiers.c >>>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) >>>> rcu_read_unlock(); >>>> } >>>> >>>> +nodemask_t next_demotion_nodes(int node) >>>> +{ >>>> + if (!node_demotion) >>>> + return NODE_MASK_NONE; >>>> + >>>> + return node_demotion[node].preferred; >>>> +} >>>> + >>>> /** >>>> * next_demotion_node() - Get the next node in the demotion path >>>> * @node: The starting node to lookup the next node