Zi Yan <ziy@xxxxxxxxxx> writes: > On 21 Sep 2022, at 2:06, Huang Ying wrote: > >> From: "Huang, Ying" <ying.huang@xxxxxxxxx> >> >> Now, migrate_pages() migrate pages one by one, like the fake code as >> follows, >> >> for each page >> unmap >> flush TLB >> copy >> restore map >> >> If multiple pages are passed to migrate_pages(), there are >> opportunities to batch the TLB flushing and copying. That is, we can >> change the code to something as follows, >> >> for each page >> unmap >> for each page >> flush TLB >> for each page >> copy >> for each page >> restore map >> >> The total number of TLB flushing IPI can be reduced considerably. And >> we may use some hardware accelerator such as DSA to accelerate the >> page copying. >> >> So in this patch, we refactor the migrate_pages() implementation and >> implement the TLB flushing batching. Base on this, hardware >> accelerated page copying can be implemented. >> >> If too many pages are passed to migrate_pages(), in the naive batched >> implementation, we may unmap too many pages at the same time. The >> possibility for a task to wait for the migrated pages to be mapped >> again increases. So the latency may be hurt. To deal with this >> issue, the max number of pages be unmapped in batch is restricted to >> no more than HPAGE_PMD_NR. That is, the influence is at the same >> level of THP migration. >> >> We use the following test to measure the performance impact of the >> patchset, >> >> On a 2-socket Intel server, >> >> - Run pmbench memory accessing benchmark >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> node 1 back and forth. >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and >> the number of pages migrated successfully per second increases 291.7%. > > Thank you for the patchset. Batching page migration will definitely > improve its throughput from my past experiments[1] and starting with > TLB flushing is a good first step. Thanks for the pointer, the patch description provides valuable information for me already! > BTW, what is the rationality behind the increased page migration > success rate per second? >From perf profiling data, in the base kernel, migrate_pages.migrate_to_node.do_migrate_pages.kernel_migrate_pages.__x64_sys_migrate_pages: 2.87 ptep_clear_flush.try_to_migrate_one.rmap_walk_anon.try_to_migrate.__unmap_and_move: 2.39 Because pmbench run in the system too, the CPU cycles of migrate_pages() is about 2.87%. While the CPU cycles for TLB flushing is 2.39%. That is, 2.39/2.87 = 83.3% CPU cycles of migrate_pages() are used for TLB flushing. After batching the TLB flushing, the perf profiling data becomes, migrate_pages.migrate_to_node.do_migrate_pages.kernel_migrate_pages.__x64_sys_migrate_pages: 2.77 move_to_new_folio.migrate_pages_batch.migrate_pages.migrate_to_node.do_migrate_pages: 1.68 copy_page.folio_copy.migrate_folio.move_to_new_folio.migrate_pages_batch: 1.21 1.21/2.77 = 43.7% CPU cycles of migrate_pages() are used for page copying now. try_to_migrate_one: 0.23 The CPU cycles of unmapping and TLB flushing becomes 0.23/2.77 = 8.3% of migrate_pages(). All in all, after the optimization, we do much less TLB flushing, which consumes a lot of CPU cycles before the optimization. So the throughput of migrate_pages() increases greatly. I will add these data in the next version of patch. Best Regards, Huang, Ying >> >> This patchset is based on v6.0-rc5 and the following patchset, >> >> [PATCH -V3 0/8] migrate_pages(): fix several bugs in error path >> https://lore.kernel.org/lkml/20220817081408.513338-1-ying.huang@xxxxxxxxx/ >> >> The migrate_pages() related code is converting to folio now. So this >> patchset cannot apply recent akpm/mm-unstable branch. This patchset >> is used to check the basic idea. If it is OK, I will rebase the >> patchset on top of folio changes. >> >> Best Regards, >> Huang, Ying > > > [1] https://lwn.net/Articles/784925/ > > -- > Best Regards, > Yan, Zi