On 9 Jan 2025, at 13:03, Shivank Garg wrote: > On 1/9/2025 8:34 PM, Zi Yan wrote: >> On 9 Jan 2025, at 6:47, Shivank Garg wrote: >> >>> On 1/3/2025 10:54 PM, Zi Yan wrote: >>> > > >>>> >>>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy >>>> to be used as well. >>> >>> I think Static Calls can be better option for this. >> >> This is the first time I hear about it. Based on the info I find, I agree >> it is a great mechanism to switch between two methods globally. >>> >>> This will give a flexible copy interface to support both CPU and various DMA-based >>> folio copy. DMA-capable driver can override the default CPU copy path without any >>> additional runtime overheads. >> >> Yes, supporting DMA-based folio copy is also my intention too. I am happy to >> with you on that. Things to note are: >> 1. DMA engine should have more copy throughput as a single CPU thread, otherwise >> the scatter-gather setup overheads will eliminate the benefit of using DMA engine. > > I agree on this. > >> 2. Unless the DMA engine is really beef and can handle all possible page migration >> requests, CPU-based migration (single or multi threads) should be a fallback. >> >> In terms of 2, I wonder how much overheads does Static Calls have when switching >> between functions. Also, a lock might be needed since falling back to CPU might >> be per migrate_pages(). Considering these two, Static Calls might not work >> as you intended if switching between CPU and DMA is needed. > > You can check Patch 4/5 and 5/5 for static call implementation for using DMA Driver > https://lore.kernel.org/linux-mm/20240614221525.19170-5-shivankg@xxxxxxx > > There are no run-time overheads of this Static call approach as update happens only > during DMA driver registration/un-registration - dma_update_migrator() > The SRCU synchronization will ensure the safety during updates. I understand this part. > > It'll use static_call(_folios_copy)() for the copy path. A wrapper inside the DMA can > ensure it fallback to folios_copy(). > > Does this address your concern regarding the 2? DMA driver will need to fall back to folios_copy() (using CPU to copy folios), when it thinks DMA engine is overloaded. In my mind, kernel should make the decision whether to use single CPU, multiple CPUs, or DMA engine based on CPU usage and DMA usage. As I am writing it, I realize that might be an overhead we want to avoid, since it takes time to get CPU usage and DMA usage information and should not be on the critical path of page migration. A better approach might be that CPU scheduler and DMA engine can call dma_update_migrator() to change _folios_copy in the static_call, based on CPU usage and DMA usage. Yes, I think Static Calls should be able to help us choose the right folio copy method, single CPU, multiple CPUs, or DMA engine, to perform folio copies. BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(), which would incur a huge overhead, based on my past experience using DMA engine for page copy. I know it is needed to make sure DMA is still present, but its cost needs to be minimized to make DMA folio copy usable. Otherwise, the 768MB/s DMA copy throughput from your cover letter cannot convince people to use it for page migration, since single CPU can achieve more than that, as you showed in the table below. > > >>> main() { >>> ... >>> >>> // code snippet to measure throughput >>> clock_gettime(CLOCK_MONOTONIC, &t1); >>> retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE); >>> clock_gettime(CLOCK_MONOTONIC, &t2); >>> >>> // tput = num_pages*PAGE_SIZE/(t2-t1) >>> >>> ... >>> } >>> >>> >>> Measurements: >>> ============ >>> vanilla: base kernel without patchset >>> mt:0 = MT kernel with use_mt_copy=0 >>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32 >>> >>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and >>> for 4KB migration and THP migration. >>> >>> -------------------- >>> #1 push_0_pull_1 = 0 (src node CPUs are used) >>> >>> #1.1 THP=Never, 4KB (GB/s): >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>> 512 1.28 1.28 1.92 1.80 2.24 2.35 2.22 2.17 >>> 4096 2.40 2.40 2.51 2.58 2.83 2.72 2.99 3.25 >>> 8192 3.18 2.88 2.83 2.69 3.49 3.46 3.57 3.80 >>> 16348 3.17 2.94 2.96 3.17 3.63 3.68 4.06 4.15 >>> >>> #1.2 THP=Always, 2MB (GB/s): >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>> 512 4.31 5.02 3.39 3.40 3.33 3.51 3.91 4.03 >>> 1024 7.13 4.49 3.58 3.56 3.91 3.87 4.39 4.57 >>> 2048 5.26 6.47 3.91 4.00 3.71 3.85 4.97 6.83 >>> 4096 9.93 7.77 4.58 3.79 3.93 3.53 6.41 4.77 >>> 8192 6.47 6.33 4.37 4.67 4.52 4.39 5.30 5.37 >>> 16348 7.66 8.00 5.20 5.22 5.24 5.28 6.41 7.02 >>> 32768 8.56 8.62 6.34 6.20 6.20 6.19 7.18 8.10 >>> 65536 9.41 9.40 7.14 7.15 7.15 7.19 7.96 8.89 >>> 262144 10.17 10.19 7.26 7.90 7.98 8.05 9.46 10.30 >>> 524288 10.40 9.95 7.25 7.93 8.02 8.76 9.55 10.30 >>> >>> -------------------- >>> #2 push_0_pull_1 = 1 (dst node CPUs are used): >>> >>> #2.1 THP=Never 4KB (GB/s): >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>> 512 1.28 1.36 2.01 2.74 2.33 2.31 2.53 2.96 >>> 4096 2.40 2.84 2.94 3.04 3.40 3.23 3.31 4.16 >>> 8192 3.18 3.27 3.34 3.94 3.77 3.68 4.23 4.76 >>> 16348 3.17 3.42 3.66 3.21 3.82 4.40 4.76 4.89 >>> >>> #2.2 THP=Always 2MB (GB/s): >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >>> 512 4.31 5.91 4.03 3.73 4.26 4.13 4.78 3.44 >>> 1024 7.13 6.83 4.60 5.13 5.03 5.19 5.94 7.25 >>> 2048 5.26 7.09 5.20 5.69 5.83 5.73 6.85 8.13 >>> 4096 9.93 9.31 4.90 4.82 4.82 5.26 8.46 8.52 >>> 8192 6.47 7.63 5.66 5.85 5.75 6.14 7.45 8.63 >>> 16348 7.66 10.00 6.35 6.54 6.66 6.99 8.18 10.21 >>> 32768 8.56 9.78 7.06 7.41 7.76 9.02 9.55 11.92 >>> 65536 9.41 10.00 8.19 9.20 9.32 8.68 11.00 13.31 >>> 262144 10.17 11.17 9.01 9.96 9.99 10.00 11.70 14.27 >>> 524288 10.40 11.38 9.07 9.98 10.01 10.09 11.95 14.48 >>> >>> Note: >>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your >>> experiment with 64KB pagesize) >>> 2. For THP = Always: nr_pages = Number of 4KB pages moved. >>> nr_pages=512 => 512 4KB pages => 1 2MB page) >>> >>> >>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is >>> relatively flat across thread counts. >>> >>> Is it possible I'm missing something in my testing? >>> >>> Could the base page size difference (4KB vs 64KB) be playing a role in >>> the scaling behavior? How the performance varies with 4KB pages on your system? >>> >>> I'd be happy to work with you on investigating this differences. >>> Let me know if you'd like any additional test data or if there are specific >>> configurations I should try. >> >> The results surprises me, since I was able to achieve ~9GB/s when migrating >> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz >> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1]. >> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can >> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does >> not make sense. >> >> One thing you might want to try is to set init_on_alloc=0 in your boot >> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That >> might reduce the time spent on page zeros. >> >> I am also going to rerun the experiments locally on x86_64 boxes to see if your >> results can be replicated. >> >> Thank you for the review and running these experiments. I really appreciate >> it.> >> >> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@xxxxxxxx/ >> > > Using init_on_alloc=0 gave significant performance gain over the last experiment > but I'm still missing the performance scaling you observed. It might be the difference between x86 and ARM64, but I am not 100% sure. Based on your data below, 2 or 4 threads seem to the sweep spot for the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between two sockets in your system? From Figure 10 in [1], I see the InfiniteBand between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional, ~25GB/s bidirectional. I wonder if your results below are cross-socket link bandwidth limited.