"Yasunori Gotou (Fujitsu)" <y-goto@xxxxxxxxxxx> writes: > Hello, > >> Li Zhijian <lizhijian@xxxxxxxxxxx> writes: >> >> > Hi Ying >> > >> > I need to pick up this thread/patch again. >> > >> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >> >> already. A node in a higher tier can demote to any node in the lower >> >> tiers. What's more need to be displayed in nodeX/demotion_nodes? >> >> >> > >> > Yes, it's believed that >> > /sys/devices/virtual/memory_tiering/memory_tierN/nodelist >> > are intended to show nodes in memory_tierN. But IMHO, it's not enough, >> > especially for the preferred demotion node(s). >> > >> > Currently, when a demotion occurs, it will prioritize selecting a node >> > from the preferred nodes as the destination node for the demotion. If >> > the preferred nodes does not meet the requirements, it will try from >> > all the lower memory tier nodes until it finds a suitable demotion >> > destination node or ultimately fails. >> > >> > However, currently it only lists the nodes of each tier. If the >> > administrators want to know all the possible demotion destinations for >> > a given node, they need to calculate it themselves: >> > Step 1, find the memory tier where the given node is located Step 2, >> > list all nodes under all its lower tiers >> > >> > It will be even more difficult to know the preferred nodes which >> > depend on more factors, distance etc. For the following example, we >> > may have 6 nodes splitting into three memory tiers. >> > >> > For emulated hmat numa topology example: >> >> $ numactl -H >> >> available: 6 nodes (0-5) >> >> node 0 cpus: 0 >> >> node 0 size: 1974 MB >> >> node 0 free: 1767 MB >> >> node 1 cpus: 1 >> >> node 1 size: 1694 MB >> >> node 1 free: 1454 MB >> >> node 2 cpus: >> >> node 2 size: 896 MB >> >> node 2 free: 896 MB >> >> node 3 cpus: >> >> node 3 size: 896 MB >> >> node 3 free: 896 MB >> >> node 4 cpus: >> >> node 4 size: 896 MB >> >> node 4 free: 896 MB >> >> node 5 cpus: >> >> node 5 size: 896 MB >> >> node 5 free: 896 MB >> >> node distances: >> >> node 0 1 2 3 4 5 >> >> 0: 10 31 21 41 21 41 >> >> 1: 31 10 41 21 41 21 >> >> 2: 21 41 10 51 21 51 >> >> 3: 31 21 51 10 51 21 >> >> 4: 21 41 21 51 10 51 >> >> 5: 31 21 51 21 51 10 >> >> $ cat memory_tier4/nodelist >> >> 0-1 >> >> $ cat memory_tier12/nodelist >> >> 2,5 >> >> $ cat memory_tier54/nodelist >> >> 3-4 >> > >> > For above topology, memory-tier will build the demotion path for each >> > node like this: >> > node[0].preferred = 2 >> > node[0].demotion_targets = 2-5 >> > node[1].preferred = 5 >> > node[1].demotion_targets = 2-5 >> > node[2].preferred = 4 >> > node[2].demotion_targets = 3-4 >> > node[3].preferred = <empty> >> > node[3].demotion_targets = <empty> >> > node[4].preferred = <empty> >> > node[4].demotion_targets = <empty> >> > node[5].preferred = 3 >> > node[5].demotion_targets = 3-4 >> > >> > But this demotion path is not explicitly known to administrator. And >> > with the feedback from our customers, they also think it is helpful to >> > know demotion path built by kernel to understand the demotion >> > behaviors. >> > >> > So i think we should have 2 new interfaces for each node: >> > >> > /sys/devices/system/node/nodeN/demotion_allowed_nodes >> > /sys/devices/system/node/nodeN/demotion_preferred_nodes >> > >> > I value your opinion, and I'd like to know what you think about... >> >> Per my understanding, we will not expose everything inside kernel to user >> space. For page placement in a tiered memory system, demotion is just a part >> of the story. For example, if the DRAM of a system becomes full, new page >> allocation will fall back to the CXL memory. Have we exposed the default page >> allocation fallback order to user space? > > In extreme terms, users want to analyze all the memory behaviors of memory management > while executing their workload, and want to trace ALL of them if possible. > Of course, it is impossible due to the heavy load, then users want to have other ways as > a compromise. Our request, the demotion target information, is just one of them. > > In my impression, users worry about the impact of the CXL memory device on their workload, > and want to have a way to understand the impact. > If they know there is no information to remove their anxious, they may avoid to buy CXL memory. > > In addition, our support team also needs to have clues to solve users' performance problems. > Even if new page allocation will fall back to the CXL memory, we need to explain why it would > happen as accountability. I guess /proc/<PID>/numa_maps /sys/fs/cgroup/<CGNAME>/memory.numa_stat may help to understand system behavior. -- Best Regards, Huang, Ying >> >> All in all, in my opinion, we only expose as little as possible to user space >> because we need to maintain the ABI for ever. > > I can understand there is a compatibility problem by our propose, and kernel may > change its logic in future. This is a tug-of-war situation between kernel developers > and users or support engineers. I suppose It often occurs in many place... > > Hmm... I hope there is a new idea to solve this situation even if our proposal is rejected.. > Anyone? > > Thanks, > ---- > Yasunori Goto > >> >> -- >> Best Regards, >> Huang, Ying >> >> > >> > On 02/11/2023 11:17, Huang, Ying wrote: >> >> Li Zhijian <lizhijian@xxxxxxxxxxx> writes: >> >> >> >>> It shows the demotion target nodes of a node. Export this >> >>> information to user directly. >> >>> >> >>> Below is an example where node0 node1 are DRAM, node3 is a PMEM >> node. >> >>> - Before PMEM is online, no demotion_nodes for node0 and node1. >> >>> $ cat /sys/devices/system/node/node0/demotion_nodes >> >>> <show nothing> >> >>> - After node3 is online as kmem >> >>> $ daxctl reconfigure-device --mode=system-ram --no-online dax0.0 && >> >>> daxctl online-memory dax0.0 [ >> >>> { >> >>> "chardev":"dax0.0", >> >>> "size":1054867456, >> >>> "target_node":3, >> >>> "align":2097152, >> >>> "mode":"system-ram", >> >>> "online_memblocks":0, >> >>> "total_memblocks":7 >> >>> } >> >>> ] >> >>> $ cat /sys/devices/system/node/node0/demotion_nodes >> >>> 3 >> >>> $ cat /sys/devices/system/node/node1/demotion_nodes >> >>> 3 >> >>> $ cat /sys/devices/system/node/node3/demotion_nodes >> >>> <show nothing> >> >> We have /sys/devices/virtual/memory_tiering/memory_tier*/nodelist >> >> already. A node in a higher tier can demote to any node in the lower >> >> tiers. What's more need to be displayed in nodeX/demotion_nodes? >> >> -- >> >> Best Regards, >> >> Huang, Ying >> >> >> >>> Signed-off-by: Li Zhijian <lizhijian@xxxxxxxxxxx> >> >>> --- >> >>> drivers/base/node.c | 13 +++++++++++++ >> >>> include/linux/memory-tiers.h | 6 ++++++ >> >>> mm/memory-tiers.c | 8 ++++++++ >> >>> 3 files changed, 27 insertions(+) >> >>> >> >>> diff --git a/drivers/base/node.c b/drivers/base/node.c index >> >>> 493d533f8375..27e8502548a7 100644 >> >>> --- a/drivers/base/node.c >> >>> +++ b/drivers/base/node.c >> >>> @@ -7,6 +7,7 @@ >> >>> #include <linux/init.h> >> >>> #include <linux/mm.h> >> >>> #include <linux/memory.h> >> >>> +#include <linux/memory-tiers.h> >> >>> #include <linux/vmstat.h> >> >>> #include <linux/notifier.h> >> >>> #include <linux/node.h> >> >>> @@ -569,11 +570,23 @@ static ssize_t node_read_distance(struct device >> *dev, >> >>> } >> >>> static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); >> >>> +static ssize_t demotion_nodes_show(struct device *dev, >> >>> + struct device_attribute *attr, char *buf) { >> >>> + int ret; >> >>> + nodemask_t nmask = next_demotion_nodes(dev->id); >> >>> + >> >>> + ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask)); >> >>> + return ret; >> >>> +} >> >>> +static DEVICE_ATTR_RO(demotion_nodes); >> >>> + >> >>> static struct attribute *node_dev_attrs[] = { >> >>> &dev_attr_meminfo.attr, >> >>> &dev_attr_numastat.attr, >> >>> &dev_attr_distance.attr, >> >>> &dev_attr_vmstat.attr, >> >>> + &dev_attr_demotion_nodes.attr, >> >>> NULL >> >>> }; >> >>> diff --git a/include/linux/memory-tiers.h >> >>> b/include/linux/memory-tiers.h index 437441cdf78f..8eb04923f965 >> >>> 100644 >> >>> --- a/include/linux/memory-tiers.h >> >>> +++ b/include/linux/memory-tiers.h >> >>> @@ -38,6 +38,7 @@ void init_node_memory_type(int node, struct >> memory_dev_type *default_type); >> >>> void clear_node_memory_type(int node, struct memory_dev_type >> *memtype); >> >>> #ifdef CONFIG_MIGRATION >> >>> int next_demotion_node(int node); >> >>> +nodemask_t next_demotion_nodes(int node); >> >>> void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t >> *targets); >> >>> bool node_is_toptier(int node); >> >>> #else >> >>> @@ -46,6 +47,11 @@ static inline int next_demotion_node(int node) >> >>> return NUMA_NO_NODE; >> >>> } >> >>> +static inline next_demotion_nodes next_demotion_nodes(int node) >> >>> +{ >> >>> + return NODE_MASK_NONE; >> >>> +} >> >>> + >> >>> static inline void node_get_allowed_targets(pg_data_t *pgdat, >> nodemask_t *targets) >> >>> { >> >>> *targets = NODE_MASK_NONE; >> >>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index >> >>> 37a4f59d9585..90047f37d98a 100644 >> >>> --- a/mm/memory-tiers.c >> >>> +++ b/mm/memory-tiers.c >> >>> @@ -282,6 +282,14 @@ void node_get_allowed_targets(pg_data_t *pgdat, >> nodemask_t *targets) >> >>> rcu_read_unlock(); >> >>> } >> >>> +nodemask_t next_demotion_nodes(int node) >> >>> +{ >> >>> + if (!node_demotion) >> >>> + return NODE_MASK_NONE; >> >>> + >> >>> + return node_demotion[node].preferred; } >> >>> + >> >>> /** >> >>> * next_demotion_node() - Get the next node in the demotion path >> >>> * @node: The starting node to lookup the next node