haoxin <xhao@xxxxxxxxxxxxxxxxx> writes: > ( 2022/9/28 H10:01, Huang, Ying S: >> haoxin <xhao@xxxxxxxxxxxxxxxxx> writes: >> >>> Hi, Huang >>> >>> ( 2022/9/21 H2:06, Huang Ying S: >>>> From: "Huang, Ying" <ying.huang@xxxxxxxxx> >>>> >>>> Now, migrate_pages() migrate pages one by one, like the fake code as >>>> follows, >>>> >>>> for each page >>>> unmap >>>> flush TLB >>>> copy >>>> restore map >>>> >>>> If multiple pages are passed to migrate_pages(), there are >>>> opportunities to batch the TLB flushing and copying. That is, we can >>>> change the code to something as follows, >>>> >>>> for each page >>>> unmap >>>> for each page >>>> flush TLB >>>> for each page >>>> copy >>>> for each page >>>> restore map >>>> >>>> The total number of TLB flushing IPI can be reduced considerably. And >>>> we may use some hardware accelerator such as DSA to accelerate the >>>> page copying. >>>> >>>> So in this patch, we refactor the migrate_pages() implementation and >>>> implement the TLB flushing batching. Base on this, hardware >>>> accelerated page copying can be implemented. >>>> >>>> If too many pages are passed to migrate_pages(), in the naive batched >>>> implementation, we may unmap too many pages at the same time. The >>>> possibility for a task to wait for the migrated pages to be mapped >>>> again increases. So the latency may be hurt. To deal with this >>>> issue, the max number of pages be unmapped in batch is restricted to >>>> no more than HPAGE_PMD_NR. That is, the influence is at the same >>>> level of THP migration. >>>> >>>> We use the following test to measure the performance impact of the >>>> patchset, >>>> >>>> On a 2-socket Intel server, >>>> >>>> - Run pmbench memory accessing benchmark >>>> >>>> - Run `migratepages` to migrate pages of pmbench between node 0 and >>>> node 1 back and forth. >>>> >>> As the pmbench can not run on arm64 machine, so i use lmbench instead. >>> I test case like this: (i am not sure whether it is reasonable, but it seems worked) >>> ./bw_mem -N10000 10000m rd & >>> time migratepages pid node0 node1 >>> >>> o/patch w/patch >>> real 0m0.035s real 0m0.024s >>> user 0m0.000s user 0m0.000s >>> sys 0m0.035s sys 0m0.024s >>> >>> the migratepages time is reduced above 32%. >>> >>> But there has a problem, i see the batch flush is called by >>> migrate_pages_batch >>> try_to_unmap_flush >>> arch_tlbbatch_flush(&tlb_ubc->arch); // there batch flush really work. >>> >>> But in arm64, the arch_tlbbatch_flush are not supported, becasue it not support CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH yet. >>> >>> So, the tlb batch flush means no any flush is did, it is a empty func. >> Yes. And should_defer_flush() will always return false too. That is, >> the TLB will still be flushed, but will not be batched. > Oh, yes, i ignore this, thank you. >> >>> Maybe this patch can help solve this problem. >>> https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@xxxxxxxxxx/T/ >> Yes. This will bring TLB flush batching to ARM64. > Next time, i will combine with this patch, and do some test again, > do you have any suggestion about benchmark ? I think your benchmark should be OK. If multiple threads are used, the effect of patchset will be better. Best Regards, Huang, Ying