On Thu, Nov 09, 2023 at 01:20:29PM +0800, Huang, Ying wrote: > Byungchul Park <byungchul@xxxxxx> writes: > > > Hi everyone, > > > > While I'm working with CXL memory, I have been facing migration overhead > > esp. TLB shootdown on promotion or demotion between different tiers. > > Yeah.. most TLB shootdowns on migration through hinting fault can be > > avoided thanks to Huang Ying's work, commit 4d4b6d66db ("mm,unmap: avoid > > flushing TLB in batch if PTE is inaccessible"). > > > > However, it's only for ones using hinting fault. I thought it'd be much > > better if we have a general mechanism to reduce # of TLB flushes and > > TLB misses, that we can apply to any type of migration. I tried it only > > for tiering migration for now tho. > > > > I'm suggesting a mechanism to reduce TLB flushes by keeping source and > > destination of folios participated in the migrations until all TLB > > flushes required are done, only if those folios are not mapped with > > write permission PTE entries at all. I worked Based on v6.6-rc5. > > > > Can you believe it? I saw the number of TLB full flush reduced about > > 80% and iTLB miss reduced about 50%, and the time wise performance > > always shows at least 1% stable improvement with the workload I tested > > with, XSBench. However, I believe that it would help more with other > > ones or any real ones. It'd be appreciated to let me know if I'm missing > > something. > > Can you help to test the effect of commit 7e12beb8ca2a ("migrate_pages: > batch flushing TLB") for your test case? To test it, you can revert it > and compare the performance before and after the reverting. > > And, how do you trigger migration when testing XSBench? Use a tiered > memory system, and migrate pages between DRAM and CXL memory back and > forth? If so, how many pages will you migrate for each migration It was not an actual CXL memory but a cpuless remote numa node's DRAM recognized as a slow tier (node_is_toptier() == false) by the kernel. It's been okay to me because I've been focusing on TLB # and migration # while working with numa tiering mechanism and, I think, the time wise performance will be followed, big or little depending on the system configuration. So it migrates pages between the two DRAMs back and forth - promotion by hinting fault and demotion by page reclaim. I tested what you asked me with another slower system to make TLB miss overhead stand out. Unfortunately I got even worse result with vanilla v6.6-rc5 than v6.6-rc5 with 7e12beb8ca2a reverted, while the 'v6.6-rc5 + migrc' definitely shows far better result. Thoughts? Byungchul --- Architecture - x86_64 QEMU - kvm enabled, host cpu Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB) Kernel - v6.6-rc5, NUMA_BALANCING_MEMORY_TIERING, demotion enabled Benchmark - XSBench -p 50000000 (-p option makes the runtime longer) CASE1 - mainline v6.6-rc5 + 7e12beb8ca2a reverted ------------------------------------------------- $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 190247118 itlb.itlb_flush 716182438 tlb_flush.dtlb_thread 327051673 tlb_flush.stlb_any 119542331968 dTLB-load-misses 724072795 dTLB-store-misses 3054343419 iTLB-load-misses 1172.580552728 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 5968431 pgmigrate_success 12484773 nr_tlb_remote_flush 6614459 nr_tlb_remote_flush_received 96022799 nr_tlb_local_flush_all 50869 nr_tlb_local_flush_one 785597 ... CASE2 - mainline v6.6-rc5 (vanilla) ------------------------------------------------- $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 55139061 itlb.itlb_flush 286725687 tlb_flush.dtlb_thread 199687660 tlb_flush.stlb_any 119497951269 dTLB-load-misses 358434759 dTLB-store-misses 1867135967 iTLB-load-misses 1181.311084373 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 8190027 pgmigrate_success 17098994 nr_tlb_remote_flush 1955114 nr_tlb_remote_flush_received 29028093 nr_tlb_local_flush_all 140921 nr_tlb_local_flush_one 740767 ... CASE3 - mainline v6.6-rc5 + migrc ------------------------------------------------- $ perf stat -a \ -e itlb.itlb_flush \ -e tlb_flush.dtlb_thread \ -e tlb_flush.stlb_any \ -e dTLB-load-misses \ -e dTLB-store-misses \ -e iTLB-load-misses \ ./XSBench -p 50000000 Performance counter stats for 'system wide': 6337091 itlb.itlb_flush 157229778 tlb_flush.dtlb_thread 148240163 tlb_flush.stlb_any 117701381319 dTLB-load-misses 231212468 dTLB-store-misses 973083466 iTLB-load-misses 1105.756705157 seconds time elapsed $ cat /proc/vmstat ... numa_pages_migrated 8791934 pgmigrate_success 18276174 nr_tlb_remote_flush 311146 nr_tlb_remote_flush_received 4387708 nr_tlb_local_flush_all 143883 nr_tlb_local_flush_one 740953 ...