On 1/16/2025 10:27 AM, Shivank Garg wrote: > On 1/11/2025 1:21 AM, Zi Yan wrote: > <snip> > > >>>> BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(), >>>> which would incur a huge overhead, based on my past experience using DMA engine >>>> for page copy. I know it is needed to make sure DMA is still present, but >>>> its cost needs to be minimized to make DMA folio copy usable. Otherwise, >>>> the 768MB/s DMA copy throughput from your cover letter cannot convince people >>>> to use it for page migration, since single CPU can achieve more than that, >>>> as you showed in the table below. > > Thank you for pointing this. > I'm learning about DMAEngine and will look more into DMA driver part. > >>>>> Using init_on_alloc=0 gave significant performance gain over the last experiment >>>>> but I'm still missing the performance scaling you observed. >>>> >>>> It might be the difference between x86 and ARM64, but I am not 100% sure. >>>> Based on your data below, 2 or 4 threads seem to the sweep spot for >>>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between >>>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand >>>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional, >>>> ~25GB/s bidirectional. I wonder if your results below are cross-socket >>>> link bandwidth limited. > > I tested the cross-socket bandwidth on my EPYC Zen 5 system and easily getting >10X > bandwidth as this. I don't think BW is a issue here. > > >>>> >>>> From my results, NVIDIA Grace CPU can achieve high copy throughput >>>> with more threads between two sockets, maybe part of the reason is that >>>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional. >>> >>> I talked to my colleague about this and he mentioned about CCD architecture >>> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate >>> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to >>> another. This means my naive scheduling algorithm, which use CPUs from >>> 0 to N threads, uses all cores from one CDD first, then move to another >>> CCD. It is not able to saturate the cross-socket bandwidth. Does it make >>> sense to you? >>> >>> If yes, can you please change the my cpu selection code in mm/copy_pages.c: > > This is making sense. > > I first tried distributing work threads across different CCDs, which yielded > better results. > > Also, I switched my system to NPS-2 config (2 Nodes per socket). This was done > to eliminate cross-socket connections and variables by focusing on intra-socket > page migrations. > > Cross-Socket (Node 0 -> Node 2) > THP Always (2 MB pages) > > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 > 262144 12.37 12.52 15.72 24.94 30.40 33.23 34.68 29.67 > 524288 12.44 12.19 15.70 24.96 32.72 33.40 35.40 29.18 > > Intra-Socket (Node 0 -> Node 1) > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 > 262144 12.37 17.10 18.65 26.05 35.56 37.80 33.73 29.29 > 524288 12.44 16.73 18.87 24.34 35.63 37.49 33.79 29.76 > > I have temporarily hardcoded the CPU assignments and will work on improving the > CPU selection code. >>> >>> + /* TODO: need a better cpu selection method */ >>> + for_each_cpu(cpu, per_node_cpumask) { >>> + if (i >= total_mt_num) >>> + break; >>> + cpu_id_list[i] = cpu; >>> + ++i; >>> + } >>> >>> to select CPUs from as many CCDs as possible and rerun the tests. >>> That might boost the page migration throughput on AMD CPUs more. >>> >>> Thanks. >>> >>>>> >>>>> THP Never >>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>>>> 512 1.40 1.43 2.79 3.48 3.63 3.73 3.63 3.57 >>>>> 4096 2.54 3.32 3.18 4.65 4.83 5.11 5.39 5.78 >>>>> 8192 3.35 4.40 4.39 4.71 3.63 5.04 5.33 6.00 >>>>> 16348 3.76 4.50 4.44 5.33 5.41 5.41 6.47 6.41 >>>>> >>>>> THP Always >>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>>>> 512 5.21 5.47 5.77 6.92 3.71 2.75 7.54 7.44 >>>>> 1024 6.10 7.65 8.12 8.41 8.87 8.55 9.13 11.36 >>>>> 2048 6.39 6.66 9.58 8.92 10.75 12.99 13.33 12.23 >>>>> 4096 7.33 10.85 8.22 13.57 11.43 10.93 12.53 16.86 >>>>> 8192 7.26 7.46 8.88 11.82 10.55 10.94 13.27 14.11 >>>>> 16348 9.07 8.53 11.82 14.89 12.97 13.22 16.14 18.10 >>>>> 32768 10.45 10.55 11.79 19.19 16.85 17.56 20.58 26.57 >>>>> 65536 11.00 11.12 13.25 18.27 16.18 16.11 19.61 27.73 >>>>> 262144 12.37 12.40 15.65 20.00 19.25 19.38 22.60 31.95 >>>>> 524288 12.44 12.33 15.66 19.78 19.06 18.96 23.31 32.29 >>>> >>>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study >> >> >> BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method. >> The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of >> vanilla kernel throughput using 8 or 16 threads. >> >> >> 4KB (GB/s) >> >> | ---- | ------- | ---- | ---- | ---- | ---- | ----- | >> | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | >> | ---- | ------- | ---- | ---- | ---- | ---- | ----- | >> | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 | >> | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 | >> | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 | >> | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 | >> | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 | >> >> >> >> 2MB (GB/s) >> | ---- | ------- | ---- | ---- | ----- | ----- | ----- | >> | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | >> | ---- | ------- | ---- | ---- | ----- | ----- | ----- | >> | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 | >> | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 | >> | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 | >> | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 | >> | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 | >> | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 | >> | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 | >> | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 | >> | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 | >> | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 | >> | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 | >> | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 | >> > > I see, thank you for checking. > > Meanwhile, I'll continue to explore for performance optimization > avenues. Hi Zi, I experimented with your testcase[1] to get 2-2.5X throughput gain than my previous experiment. Multi-threading scaling for 32 threads is ~4X (slightly higher than my previous experiment). The main difference between our move_pages throughput benchmarks was: You're explicitly using THP using aligned_alloc and MADV_HUGEPAGE whereas I'm relying on system THP operating on 4KB boundaries. While both methods use THP and expected similar performance, we saw lower throughput in my case because: In my test code, The kernel processes all 512 4KB folios within a 2MB region in the first migration attempt. For subsequent folios, __add_folio_for_migration() returns early with folio_nid(folio) == node, as the pages are already on the target node. This adds extra overheads of vma_lookup and folio_walk_start() in my experiment. 2MB pages (GB/s): nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 1 10.74 11.04 4.68 8.17 6.47 6.09 3.97 6.20 2 12.44 4.90 11.19 14.10 15.33 8.45 10.09 9.97 4 14.82 9.80 11.93 18.35 21.82 17.09 10.53 7.51 8 16.13 9.91 15.26 11.85 26.53 13.09 12.71 13.75 16 15.99 8.81 13.84 22.43 33.89 11.91 12.30 13.26 32 14.03 11.37 17.54 23.96 57.07 18.78 19.51 21.29 64 15.79 9.55 22.19 33.17 57.18 65.51 55.39 62.53 128 18.22 16.65 21.49 30.73 52.99 61.05 58.44 60.38 256 19.78 20.56 24.72 34.94 56.73 71.11 61.83 62.77 512 20.27 21.40 27.47 39.23 65.72 67.97 70.48 71.39 1024 20.48 21.48 27.48 38.30 68.62 77.94 78.00 78.95 [1]: https://github.com/x-y-z/thp-migration-bench/blob/arm64/move_thp.c > > Best Regards, > Shivank >> >> >> Best Regards, >> Yan, Zi >> >