"Yasunori Gotou (Fujitsu)" <y-goto@xxxxxxxxxxx> writes: >> > Hello, >> > >> >> On 02/11/2023 13:45, Huang, Ying wrote: >> >> > Li Zhijian <lizhijian@xxxxxxxxxxx> writes: >> >> > >> >> >> pgdemote_src_*: pages demoted from this node. >> >> >> pgdemote_dst_*: pages demoted to this node. >> >> >> >> >> >> So that we are able to know their demotion per-node stats by checking >> this. >> >> >> >> >> >> In the environment, node0 and node1 are DRAM, node3 is PMEM. >> >> >> >> >> >> Global stats: >> >> >> $ grep -E 'demote' /proc/vmstat >> >> >> pgdemote_src_kswapd 130155 >> >> >> pgdemote_src_direct 113497 >> >> >> pgdemote_src_khugepaged 0 >> >> >> pgdemote_dst_kswapd 130155 >> >> >> pgdemote_dst_direct 113497 >> >> >> pgdemote_dst_khugepaged 0 >> >> >> >> >> >> Per-node stats: >> >> >> $ grep demote /sys/devices/system/node/node0/vmstat >> >> >> pgdemote_src_kswapd 68454 >> >> >> pgdemote_src_direct 83431 >> >> >> pgdemote_src_khugepaged 0 >> >> >> pgdemote_dst_kswapd 0 >> >> >> pgdemote_dst_direct 0 >> >> >> pgdemote_dst_khugepaged 0 >> >> >> >> >> >> $ grep demote /sys/devices/system/node/node1/vmstat >> >> >> pgdemote_src_kswapd 185834 >> >> >> pgdemote_src_direct 30066 >> >> >> pgdemote_src_khugepaged 0 >> >> >> pgdemote_dst_kswapd 0 >> >> >> pgdemote_dst_direct 0 >> >> >> pgdemote_dst_khugepaged 0 >> >> >> >> >> >> $ grep demote /sys/devices/system/node/node3/vmstat >> >> >> pgdemote_src_kswapd 0 >> >> >> pgdemote_src_direct 0 >> >> >> pgdemote_src_khugepaged 0 >> >> >> pgdemote_dst_kswapd 254288 >> >> >> pgdemote_dst_direct 113497 >> >> >> pgdemote_dst_khugepaged 0 >> >> >> >> >> >> From above stats, we know node3 is the demotion destination which >> >> >> one the node0 and node1 will demote to. >> >> > >> >> > Why do we need these information? Do you have some use case? >> >> >> >> I recall our customers have mentioned that they want to know how much >> >> the memory is demoted to the CXL memory device in a specific period. >> > >> > I'll mention about it more. >> > >> > I had a conversation with one of our customers. He expressed a desire >> > for more detailed profile information to analyze the behavior of >> > demotion (and promotion) when his workloads are executed. >> > If the results are not satisfactory for his workloads, he wants to >> > tune his servers for his workloads with these profiles. >> > Additionally, depending on the results, he may want to change his server >> configuration. >> > For example, he may want to buy more expensive DDR memories rather than >> cheaper CXL memory. >> > >> > In my impression, our customers seems to think that CXL memory is NOT as >> reliable as DDR memory yet. >> > Therefore, they want to prepare for the new world that CXL will bring, >> > and want to have a method for the preparation by profiling information as >> much as possible. >> > >> > it this enough for your question? >> >> I want some more detailed information about how these stats are used? >> Why isn't per-node pgdemote_xxx counter enough? > > I rechecked the customer's original request. > > - If a memory area is demoted to a CXL memory node, he wanted to analyze how it affects performance > of their workload, such as latency. He wanted to use CXL Node memory usage as basic > information for the analysis. > > - If he notices that demotion occurs well on a server and CXL memories are used 85% constantly, he > may want to add DDR DRAM or select some other ways to avoid demotion. > (His image is likely Swap free/used.) > IIRC, demotion target is not spread to all of the CXL memory node, right? > Then, he needs to know how CXL memory is occupied by demoted memory. > > If I misunderstand something, or you have any better idea, > please let us know. I'll talk with him again. (It will be next week...) To check CXL memory usage, /proc/PID/numa_maps, /sys/fs/cgroup/CGROUP/memory.numa_stat, and /sys/devices/system/node/nodeN/meminfo can be used for process, cgroup, and NUMA node respectively. Is this enough? -- Best Regards, Huang, Ying >> > >> >> >> >> >> >> >>> mod_node_page_state(NODE_DATA(target_nid), >> >> >>> - PGDEMOTE_KSWAPD + reclaimer_offset(), >> >> nr_succeeded); >> >> >>> + PGDEMOTE_DST_KSWAPD + reclaimer_offset(), >> >> nr_succeeded); >> >> >> >> But if the *target_nid* is only indicate the preferred node, this >> >> accounting maybe not accurate. >> >> [snip]