On 1/11/2025 1:21 AM, Zi Yan wrote: <snip> >>> BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(), >>> which would incur a huge overhead, based on my past experience using DMA engine >>> for page copy. I know it is needed to make sure DMA is still present, but >>> its cost needs to be minimized to make DMA folio copy usable. Otherwise, >>> the 768MB/s DMA copy throughput from your cover letter cannot convince people >>> to use it for page migration, since single CPU can achieve more than that, >>> as you showed in the table below. Thank you for pointing this. I'm learning about DMAEngine and will look more into DMA driver part. >>>> Using init_on_alloc=0 gave significant performance gain over the last experiment >>>> but I'm still missing the performance scaling you observed. >>> >>> It might be the difference between x86 and ARM64, but I am not 100% sure. >>> Based on your data below, 2 or 4 threads seem to the sweep spot for >>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between >>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand >>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional, >>> ~25GB/s bidirectional. I wonder if your results below are cross-socket >>> link bandwidth limited. I tested the cross-socket bandwidth on my EPYC Zen 5 system and easily getting >10X bandwidth as this. I don't think BW is a issue here. >>> >>> From my results, NVIDIA Grace CPU can achieve high copy throughput >>> with more threads between two sockets, maybe part of the reason is that >>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional. >> >> I talked to my colleague about this and he mentioned about CCD architecture >> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate >> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to >> another. This means my naive scheduling algorithm, which use CPUs from >> 0 to N threads, uses all cores from one CDD first, then move to another >> CCD. It is not able to saturate the cross-socket bandwidth. Does it make >> sense to you? >> >> If yes, can you please change the my cpu selection code in mm/copy_pages.c: This is making sense. I first tried distributing work threads across different CCDs, which yielded better results. Also, I switched my system to NPS-2 config (2 Nodes per socket). This was done to eliminate cross-socket connections and variables by focusing on intra-socket page migrations. Cross-Socket (Node 0 -> Node 2) THP Always (2 MB pages) nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 262144 12.37 12.52 15.72 24.94 30.40 33.23 34.68 29.67 524288 12.44 12.19 15.70 24.96 32.72 33.40 35.40 29.18 Intra-Socket (Node 0 -> Node 1) nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 262144 12.37 17.10 18.65 26.05 35.56 37.80 33.73 29.29 524288 12.44 16.73 18.87 24.34 35.63 37.49 33.79 29.76 I have temporarily hardcoded the CPU assignments and will work on improving the CPU selection code. >> >> + /* TODO: need a better cpu selection method */ >> + for_each_cpu(cpu, per_node_cpumask) { >> + if (i >= total_mt_num) >> + break; >> + cpu_id_list[i] = cpu; >> + ++i; >> + } >> >> to select CPUs from as many CCDs as possible and rerun the tests. >> That might boost the page migration throughput on AMD CPUs more. >> >> Thanks. >> >>>> >>>> THP Never >>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>>> 512 1.40 1.43 2.79 3.48 3.63 3.73 3.63 3.57 >>>> 4096 2.54 3.32 3.18 4.65 4.83 5.11 5.39 5.78 >>>> 8192 3.35 4.40 4.39 4.71 3.63 5.04 5.33 6.00 >>>> 16348 3.76 4.50 4.44 5.33 5.41 5.41 6.47 6.41 >>>> >>>> THP Always >>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>>> 512 5.21 5.47 5.77 6.92 3.71 2.75 7.54 7.44 >>>> 1024 6.10 7.65 8.12 8.41 8.87 8.55 9.13 11.36 >>>> 2048 6.39 6.66 9.58 8.92 10.75 12.99 13.33 12.23 >>>> 4096 7.33 10.85 8.22 13.57 11.43 10.93 12.53 16.86 >>>> 8192 7.26 7.46 8.88 11.82 10.55 10.94 13.27 14.11 >>>> 16348 9.07 8.53 11.82 14.89 12.97 13.22 16.14 18.10 >>>> 32768 10.45 10.55 11.79 19.19 16.85 17.56 20.58 26.57 >>>> 65536 11.00 11.12 13.25 18.27 16.18 16.11 19.61 27.73 >>>> 262144 12.37 12.40 15.65 20.00 19.25 19.38 22.60 31.95 >>>> 524288 12.44 12.33 15.66 19.78 19.06 18.96 23.31 32.29 >>> >>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study > > > BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method. > The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of > vanilla kernel throughput using 8 or 16 threads. > > > 4KB (GB/s) > > | ---- | ------- | ---- | ---- | ---- | ---- | ----- | > | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | > | ---- | ------- | ---- | ---- | ---- | ---- | ----- | > | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 | > | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 | > | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 | > | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 | > | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 | > > > > 2MB (GB/s) > | ---- | ------- | ---- | ---- | ----- | ----- | ----- | > | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | > | ---- | ------- | ---- | ---- | ----- | ----- | ----- | > | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 | > | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 | > | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 | > | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 | > | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 | > | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 | > | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 | > | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 | > | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 | > | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 | > | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 | > | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 | > I see, thank you for checking. Meanwhile, I'll continue to explore for performance optimization avenues. Best Regards, Shivank > > > Best Regards, > Yan, Zi >