On 10 Jan 2025, at 12:05, Zi Yan wrote: > <snip> >>> >>>>> main() { >>>>> ... >>>>> >>>>> // code snippet to measure throughput >>>>> clock_gettime(CLOCK_MONOTONIC, &t1); >>>>> retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE); >>>>> clock_gettime(CLOCK_MONOTONIC, &t2); >>>>> >>>>> // tput = num_pages*PAGE_SIZE/(t2-t1) >>>>> >>>>> ... >>>>> } >>>>> >>>>> >>>>> Measurements: >>>>> ============ >>>>> vanilla: base kernel without patchset >>>>> mt:0 = MT kernel with use_mt_copy=0 >>>>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32 >>>>> >>>>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and >>>>> for 4KB migration and THP migration. >>>>> >>>>> -------------------- >>>>> #1 push_0_pull_1 = 0 (src node CPUs are used) >>>>> >>>>> #1.1 THP=Never, 4KB (GB/s): >>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>>>> 512 1.28 1.28 1.92 1.80 2.24 2.35 2.22 2.17 >>>>> 4096 2.40 2.40 2.51 2.58 2.83 2.72 2.99 3.25 >>>>> 8192 3.18 2.88 2.83 2.69 3.49 3.46 3.57 3.80 >>>>> 16348 3.17 2.94 2.96 3.17 3.63 3.68 4.06 4.15 >>>>> >>>>> #1.2 THP=Always, 2MB (GB/s): >>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>>>> 512 4.31 5.02 3.39 3.40 3.33 3.51 3.91 4.03 >>>>> 1024 7.13 4.49 3.58 3.56 3.91 3.87 4.39 4.57 >>>>> 2048 5.26 6.47 3.91 4.00 3.71 3.85 4.97 6.83 >>>>> 4096 9.93 7.77 4.58 3.79 3.93 3.53 6.41 4.77 >>>>> 8192 6.47 6.33 4.37 4.67 4.52 4.39 5.30 5.37 >>>>> 16348 7.66 8.00 5.20 5.22 5.24 5.28 6.41 7.02 >>>>> 32768 8.56 8.62 6.34 6.20 6.20 6.19 7.18 8.10 >>>>> 65536 9.41 9.40 7.14 7.15 7.15 7.19 7.96 8.89 >>>>> 262144 10.17 10.19 7.26 7.90 7.98 8.05 9.46 10.30 >>>>> 524288 10.40 9.95 7.25 7.93 8.02 8.76 9.55 10.30 >>>>> >>>>> -------------------- >>>>> #2 push_0_pull_1 = 1 (dst node CPUs are used): >>>>> >>>>> #2.1 THP=Never 4KB (GB/s): >>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>>>> 512 1.28 1.36 2.01 2.74 2.33 2.31 2.53 2.96 >>>>> 4096 2.40 2.84 2.94 3.04 3.40 3.23 3.31 4.16 >>>>> 8192 3.18 3.27 3.34 3.94 3.77 3.68 4.23 4.76 >>>>> 16348 3.17 3.42 3.66 3.21 3.82 4.40 4.76 4.89 >>>>> >>>>> #2.2 THP=Always 2MB (GB/s): >>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>>>> 512 4.31 5.91 4.03 3.73 4.26 4.13 4.78 3.44 >>>>> 1024 7.13 6.83 4.60 5.13 5.03 5.19 5.94 7.25 >>>>> 2048 5.26 7.09 5.20 5.69 5.83 5.73 6.85 8.13 >>>>> 4096 9.93 9.31 4.90 4.82 4.82 5.26 8.46 8.52 >>>>> 8192 6.47 7.63 5.66 5.85 5.75 6.14 7.45 8.63 >>>>> 16348 7.66 10.00 6.35 6.54 6.66 6.99 8.18 10.21 >>>>> 32768 8.56 9.78 7.06 7.41 7.76 9.02 9.55 11.92 >>>>> 65536 9.41 10.00 8.19 9.20 9.32 8.68 11.00 13.31 >>>>> 262144 10.17 11.17 9.01 9.96 9.99 10.00 11.70 14.27 >>>>> 524288 10.40 11.38 9.07 9.98 10.01 10.09 11.95 14.48 >>>>> >>>>> Note: >>>>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your >>>>> experiment with 64KB pagesize) >>>>> 2. For THP = Always: nr_pages = Number of 4KB pages moved. >>>>> nr_pages=512 => 512 4KB pages => 1 2MB page) >>>>> >>>>> >>>>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is >>>>> relatively flat across thread counts. >>>>> >>>>> Is it possible I'm missing something in my testing? >>>>> >>>>> Could the base page size difference (4KB vs 64KB) be playing a role in >>>>> the scaling behavior? How the performance varies with 4KB pages on your system? >>>>> >>>>> I'd be happy to work with you on investigating this differences. >>>>> Let me know if you'd like any additional test data or if there are specific >>>>> configurations I should try. >>>> >>>> The results surprises me, since I was able to achieve ~9GB/s when migrating >>>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz >>>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1]. >>>> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can >>>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does >>>> not make sense. >>>> >>>> One thing you might want to try is to set init_on_alloc=0 in your boot >>>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That >>>> might reduce the time spent on page zeros. >>>> >>>> I am also going to rerun the experiments locally on x86_64 boxes to see if your >>>> results can be replicated. >>>> >>>> Thank you for the review and running these experiments. I really appreciate >>>> it.> >>>> >>>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/ >>>> >>> >>> Using init_on_alloc=0 gave significant performance gain over the last experiment >>> but I'm still missing the performance scaling you observed. >> >> It might be the difference between x86 and ARM64, but I am not 100% sure. >> Based on your data below, 2 or 4 threads seem to the sweep spot for >> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between >> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand >> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional, >> ~25GB/s bidirectional. I wonder if your results below are cross-socket >> link bandwidth limited. >> >> From my results, NVIDIA Grace CPU can achieve high copy throughput >> with more threads between two sockets, maybe part of the reason is that >> its cross-socket link theoretical bandwidth is 900GB/s bidirectional. > > I talked to my colleague about this and he mentioned about CCD architecture > on AMD CPUs. IIUC, one or two cores from one CCD can already saturate > the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to > another. This means my naive scheduling algorithm, which use CPUs from > 0 to N threads, uses all cores from one CDD first, then move to another > CCD. It is not able to saturate the cross-socket bandwidth. Does it make > sense to you? > > If yes, can you please change the my cpu selection code in mm/copy_pages.c: > > + /* TODO: need a better cpu selection method */ > + for_each_cpu(cpu, per_node_cpumask) { > + if (i >= total_mt_num) > + break; > + cpu_id_list[i] = cpu; > + ++i; > + } > > to select CPUs from as many CCDs as possible and rerun the tests. > That might boost the page migration throughput on AMD CPUs more. > > Thanks. > >>> >>> THP Never >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>> 512 1.40 1.43 2.79 3.48 3.63 3.73 3.63 3.57 >>> 4096 2.54 3.32 3.18 4.65 4.83 5.11 5.39 5.78 >>> 8192 3.35 4.40 4.39 4.71 3.63 5.04 5.33 6.00 >>> 16348 3.76 4.50 4.44 5.33 5.41 5.41 6.47 6.41 >>> >>> THP Always >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>> 512 5.21 5.47 5.77 6.92 3.71 2.75 7.54 7.44 >>> 1024 6.10 7.65 8.12 8.41 8.87 8.55 9.13 11.36 >>> 2048 6.39 6.66 9.58 8.92 10.75 12.99 13.33 12.23 >>> 4096 7.33 10.85 8.22 13.57 11.43 10.93 12.53 16.86 >>> 8192 7.26 7.46 8.88 11.82 10.55 10.94 13.27 14.11 >>> 16348 9.07 8.53 11.82 14.89 12.97 13.22 16.14 18.10 >>> 32768 10.45 10.55 11.79 19.19 16.85 17.56 20.58 26.57 >>> 65536 11.00 11.12 13.25 18.27 16.18 16.11 19.61 27.73 >>> 262144 12.37 12.40 15.65 20.00 19.25 19.38 22.60 31.95 >>> 524288 12.44 12.33 15.66 19.78 19.06 18.96 23.31 32.29 >> >> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method. The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of vanilla kernel throughput using 8 or 16 threads. 4KB (GB/s) | ---- | ------- | ---- | ---- | ---- | ---- | ----- | | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | | ---- | ------- | ---- | ---- | ---- | ---- | ----- | | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 | | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 | | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 | | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 | | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 | 2MB (GB/s) | ---- | ------- | ---- | ---- | ----- | ----- | ----- | | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | | ---- | ------- | ---- | ---- | ----- | ----- | ----- | | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 | | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 | | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 | | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 | | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 | | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 | | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 | | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 | | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 | | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 | | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 | | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 | Best Regards, Yan, Zi