On 27-Feb-23 1:24 PM, Huang, Ying wrote: > Thank you very much for detailed data. Can you provide some analysis > for your data? The overhead numbers I shared earlier weren't correct as I realized that while obtaining those numbers from function_graph tracing, the trace buffer was silently getting overrun. I had to reduce the number of memory access iterations to ensure that I get the full trace buffer. I will be summarizing the findings based on this new numbers below. Just to recap - The microbenchmark is run on an AMD Genoa two node system. The benchmark has two set of threads, (one affined to each node) accessing two different chunks of memory (chunk size 8G) which are initially allocated on first node. The benchmark touches each page in the chunk iteratively for a fixed number of iterations (384 in this case given below). The benchmark score is the amount of time it takes to complete the specified number of accesses. Here is the data for the benchmark run: Time taken or overhead (us) for fault, task_work and sched_switch handling Default IBS Fault handling 2875354862 2602455 Task work handling 139023 24008121 Sched switch handling 37712 Total overhead 2875493885 26648288 Default ------- Total Min Max Avg do_numa_page 2875354862 0.08 392.13 22.11 task_numa_work 139023 0.14 5365.77 532.66 Total 2875493885 IBS --- Total Min Max Avg ibs_overflow_handler 2602455 0.14 103.91 1.29 task_ibs_access_work 24008121 0.17 485.09 37.65 hw_access_sched_in 37712 0.15 287.55 1.35 Total 26648288 Default IBS Benchmark score(us) 160171762.0 40323293.0 numa_pages_migrated 2097220 511791 Overhead per page 1371 52 Pages migrated per sec 13094 12692 numa_hint_faults_local 2820311 140856 numa_hint_faults 38589520 652647 hint_faults_local/hint_faults 7% 22% Here is the summary: - In case of IBS, the benchmark completes 75% faster compared to the default case. The gain varies based on how many iterations of memory accesses we run as part of the benchmark. For 2048 iterations of accesses, I have seen a gain of around 50%. - The overhead of NUMA balancing (as measured by the time taken in the fault handling, task_work time handling and sched_switch time handling) in the default case is seen to be pretty high compared to the IBS case. - The number of hint-faults in the default case is significantly higher than the IBS case. - The local hint-faults percentage is much better in the IBS case compared to the default case. - As shown in the graphs (in other threads of this mail thread), in the default case, the page migrations start a bit slowly while IBS case shows steady migrations right from the start. - I have also shown (via graphs in other threads of this mail thread) that in IBS case the benchmark is able to steadily increase the access iterations over time, while in the default case, the benchmark doesn't do forward progress for a long time after an initial increase. - Early migrations due to relevant access sampling from IBS, is most probably the significant reason for the uplift that IBS case gets. - It is consistently seen that the benchmark in the IBS case manages to complete the specified number of accesses even before the entire chunk of memory gets migrated. The early migrations are offsetting the cost of remote accesses too. - In the IBS case, we re-program the IBS counters for the incoming task in the sched_switch path. It is seen that this overhead isn't that significant to slow down the benchmark. - One of the differences between the default case and the IBS case is about when the faults-since-last-scan is updated/folded into the historical faults stats and subsequent scan period update. Since we don't have the notion of scanning in IBS, I have a threshold (number of access faults) to determine when to update the historical faults and the IBS sample period. I need to check if quicker migrations could result from this change. - Finally, all this is for the above mentioned microbenchmark. The gains on other benchmarks is yet to be evaluated. Regards, Bharata.