Bharata B Rao <bharata@xxxxxxx> writes: > On 28-Mar-24 11:05 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@xxxxxxx> writes: >> >>> In order to check how efficiently the existing NUMA balancing >>> based hot page promotion mechanism can detect hot regions and >>> promote pages for workloads with large memory footprints, I >>> wrote and tested a program that allocates huge amount of >>> memory but routinely touches only small parts of it. >>> >>> This microbenchmark provisions memory both on DRAM node and CXL node. >>> It then divides the entire allocated memory into chunks of smaller >>> size and randomly choses a chunk for generating memory accesses. >>> Each chunk is then accessed for a fixed number of iterations to >>> create the notion of hotness. Within each chunk, the individual >>> pages at 4K granularity are again accessed in random fashion. >>> >>> When a chunk is taken up for access in this manner, its pages >>> can either be residing on DRAM or CXL. In the latter case, the NUMA >>> balancing driven hot page promotion logic is expected to detect and >>> promote the hot pages that reside on CXL. >>> >>> The experiment was conducted on a 2P AMD Bergamo system that has >>> CXL as the 3rd node. >>> >>> $ numactl -H >>> available: 3 nodes (0-2) >>> node 0 cpus: 0-127,256-383 >>> node 0 size: 128054 MB >>> node 1 cpus: 128-255,384-511 >>> node 1 size: 128880 MB >>> node 2 cpus: >>> node 2 size: 129024 MB >>> node distances: >>> node 0 1 2 >>> 0: 10 32 60 >>> 1: 32 10 50 >>> 2: 255 255 10 >>> >>> It is seen that number of pages that get promoted is really low and >>> the reason for it happens to be that the NUMA hint fault latency turns >>> out to be much higher than the hot threshold most of the times. Here >>> are a few latency and threshold sample values captured from >>> should_numa_migrate_memory() routine when the benchmark was run: >>> >>> latency threshold (in ms) >>> 20620 1125 >>> 56185 1125 >>> 98710 1250 >>> 148871 1375 >>> 182891 1625 >>> 369415 1875 >>> 630745 2000 >> >> The access latency of your workload is 20s to 630s, which appears too >> long. Can you try to increase the range of threshold to deal with that? >> For example, >> >> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms > > That of course should help. But I was exploring alternatives where the > notion of hotness can be de-linked from the absolute scanning time to In fact, only relative time from scan to hint fault is recorded and calculated, we have only limited bits. > the extent possible. For large memory workloads where only parts of memory > get accessed at once, the scanning time can lag from the actual access > time significantly as the data above shows. Wondering if such cases can > be addressed without having to be workload-specific. Does it really matter to promote the quite cold pages (accessed every more than 20s)? And if so, how can we adjust the current algorithm to cover that? I think that may be possible via extending the threshold range. And I think that we can find some way to extending the range by default if necessary. -- Best Regards, Huang, Ying