Bharata B Rao <bharata@xxxxxxx> writes: > On 27-Feb-23 1:24 PM, Huang, Ying wrote: >> Thank you very much for detailed data. Can you provide some analysis >> for your data? > > The overhead numbers I shared earlier weren't correct as I > realized that while obtaining those numbers from function_graph > tracing, the trace buffer was silently getting overrun. I had to > reduce the number of memory access iterations to ensure that I get > the full trace buffer. I will be summarizing the findings > based on this new numbers below. > > Just to recap - The microbenchmark is run on an AMD Genoa > two node system. The benchmark has two set of threads, > (one affined to each node) accessing two different chunks > of memory (chunk size 8G) which are initially allocated > on first node. The benchmark touches each page in the > chunk iteratively for a fixed number of iterations (384 > in this case given below). The benchmark score is the > amount of time it takes to complete the specified number > of accesses. > > Here is the data for the benchmark run: > > Time taken or overhead (us) for fault, task_work and sched_switch > handling > > Default IBS > Fault handling 2875354862 2602455 > Task work handling 139023 24008121 > Sched switch handling 37712 > Total overhead 2875493885 26648288 > > Default > ------- > Total Min Max Avg > do_numa_page 2875354862 0.08 392.13 22.11 > task_numa_work 139023 0.14 5365.77 532.66 > Total 2875493885 > > IBS > --- > Total Min Max Avg > ibs_overflow_handler 2602455 0.14 103.91 1.29 > task_ibs_access_work 24008121 0.17 485.09 37.65 > hw_access_sched_in 37712 0.15 287.55 1.35 > Total 26648288 > > > Default IBS > Benchmark score(us) 160171762.0 40323293.0 > numa_pages_migrated 2097220 511791 > Overhead per page 1371 52 > Pages migrated per sec 13094 12692 > numa_hint_faults_local 2820311 140856 > numa_hint_faults 38589520 652647 For default, numa_hint_faults >> numa_pages_migrated. It's hard to be understood. I guess that there aren't many shared pages in the benchmark? And I guess that the free pages in the target node is enough too? > hint_faults_local/hint_faults 7% 22% > > Here is the summary: > > - In case of IBS, the benchmark completes 75% faster compared to > the default case. The gain varies based on how many iterations of > memory accesses we run as part of the benchmark. For 2048 iterations > of accesses, I have seen a gain of around 50%. > - The overhead of NUMA balancing (as measured by the time taken in > the fault handling, task_work time handling and sched_switch time > handling) in the default case is seen to be pretty high compared to > the IBS case. > - The number of hint-faults in the default case is significantly > higher than the IBS case. > - The local hint-faults percentage is much better in the IBS > case compared to the default case. > - As shown in the graphs (in other threads of this mail thread), in > the default case, the page migrations start a bit slowly while IBS > case shows steady migrations right from the start. > - I have also shown (via graphs in other threads of this mail thread) > that in IBS case the benchmark is able to steadily increase > the access iterations over time, while in the default case, the > benchmark doesn't do forward progress for a long time after > an initial increase. Hard to understand this too. Pages are migrated to local, but performance doesn't improve. > - Early migrations due to relevant access sampling from IBS, > is most probably the significant reason for the uplift that IBS > case gets. In original kernel, the NUMA page table scanning will delay for a while. Please check the below comments in task_tick_numa(). /* * Using runtime rather than walltime has the dual advantage that * we (mostly) drive the selection from busy threads and that the * task needs to have done some actual work before we bother with * NUMA placement. */ I think this is generally reasonable, while it's not best for this micro-benchmark. Best Regards, Huang, Ying > - It is consistently seen that the benchmark in the IBS case manages > to complete the specified number of accesses even before the entire > chunk of memory gets migrated. The early migrations are offsetting > the cost of remote accesses too. > - In the IBS case, we re-program the IBS counters for the incoming > task in the sched_switch path. It is seen that this overhead isn't > that significant to slow down the benchmark. > - One of the differences between the default case and the IBS case > is about when the faults-since-last-scan is updated/folded into the > historical faults stats and subsequent scan period update. Since we > don't have the notion of scanning in IBS, I have a threshold (number > of access faults) to determine when to update the historical faults > and the IBS sample period. I need to check if quicker migrations > could result from this change. > - Finally, all this is for the above mentioned microbenchmark. The > gains on other benchmarks is yet to be evaluated. > > Regards, > Bharata.