On Wed, May 22, 2024 at 03:38:04PM +0800, Huang, Ying wrote: > Hi, Byungchul, > > Byungchul Park <byungchul@xxxxxx> writes: > > > On Mon, May 13, 2024 at 10:44:29AM +0900, Byungchul Park wrote: > >> On Sat, May 11, 2024 at 03:15:01PM +0800, Huang, Ying wrote: > >> > Byungchul Park <byungchul@xxxxxx> writes: > >> > > >> > > Hi everyone, > >> > > > >> > > While I'm working with a tiered memory system e.g. CXL memory, I have > >> > > been facing migration overhead esp. tlb shootdown on promotion or > >> > > demotion between different tiers. Yeah.. most tlb shootdowns on > >> > > migration through hinting fault can be avoided thanks to Huang Ying's > >> > > work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE > >> > > is inaccessible"). See the following link for more information: > >> > > > >> > > https://lore.kernel.org/lkml/20231115025755.GA29979@xxxxxxxxxxxxxxxxxxx/ > >> > > >> > And, I still have interest of the performance impact of commit > >> > 7e12beb8ca2a ("migrate_pages: batch flushing TLB"). In the email above, > >> > you said that the performance of v6.5-rc5 + 7e12beb8ca2a reverted has > >> > better performance than v6.5-rc5. Can you provide more details? For > >> > example, the number of TLB flushing IPI for two kernels? > >> > >> Okay. I will test and share the result with what you asked me now once > >> I get available for the test. > > > > I should admit that the test using qemu is so unstable. While using > > qemu for the test, kernel with 7e12beb8ca2a applied gave better results > > sometimes and worse ones sometimes. I should've used a bare metal from > > the beginning. Sorry for making you confused with the unstable result. > > > > Since I thought you asked me for the test with the same environment in > > the link above, I used qemu to reproduce the similar result but changed > > the number of threads for the test from 16 to 14 to get rid of noise > > that might be introduced by other than the intended test just in case. > > > > As expected, the stats are better with your work: > > > > ------------------------------------------ > > v6.6-rc5 with 7e12beb8ca2a commit reverted > > ------------------------------------------ > > > > 1) from output of XSBench > > > > Threads: 14 > > Runtime: 1127.043 seconds > > Lookups: 1,700,000,000 > > Lookups/s: 1,508,371 > > > > 2) from /proc/vmstat > > > > numa_hit 15580171 > > numa_miss 1034233 > > numa_foreign 1034233 > > numa_interleave 773 > > numa_local 7927442 > > numa_other 8686962 > > numa_pte_updates 24068923 > > numa_hint_faults 24061125 > > numa_hint_faults_local 0 > > numa_pages_migrated 7426480 > > pgmigrate_success 15407375 > > pgmigrate_fail 1849 > > compact_migrate_scanned 4445414 > > compact_daemon_migrate_scanned 4445414 > > pgdemote_kswapd 7651061 > > pgdemote_direct 0 > > nr_tlb_remote_flush 8080092 > > nr_tlb_remote_flush_received 109915713 > > nr_tlb_local_flush_all 53800 > > nr_tlb_local_flush_one 770466 > > > > 3) from /proc/interrupts > > > > TLB: 8022927 7840769 123588 7837008 7835967 7839837 > > 7838332 7839886 7837610 7837221 7834524 407260 > > 7430090 7835696 7839081 7712568 TLB shootdowns > > > > 4) from 'perf stat -a' > > > > 222371217 itlb.itlb_flush > > 919832520 tlb_flush.dtlb_thread > > 372223809 tlb_flush.stlb_any > > 120210808042 dTLB-load-misses > > 979352769 dTLB-store-misses > > 3650767665 iTLB-load-misses > > > > ----------------------------------------- > > v6.6-rc5 with 7e12beb8ca2a commit applied > > ----------------------------------------- > > > > 1) from output of XSBench > > > > Threads: 14 > > Runtime: 1105.521 seconds > > Lookups: 1,700,000,000 > > Lookups/s: 1,537,737 > > > > 2) from /proc/vmstat > > > > numa_hit 24148399 > > numa_miss 797483 > > numa_foreign 797483 > > numa_interleave 772 > > numa_local 12214575 > > numa_other 12731307 > > numa_pte_updates 24250278 > > numa_hint_faults 24199756 > > numa_hint_faults_local 0 > > numa_pages_migrated 11476195 > > pgmigrate_success 23634639 > > pgmigrate_fail 1391 > > compact_migrate_scanned 3760803 > > compact_daemon_migrate_scanned 3760803 > > pgdemote_kswapd 11932217 > > pgdemote_direct 0 > > nr_tlb_remote_flush 2151945 > > nr_tlb_remote_flush_received 29672808 > > nr_tlb_local_flush_all 124006 > > nr_tlb_local_flush_one 741165 > > > > 3) from /proc/interrupts > > > > TLB: 2130784 2120142 2117571 844962 2071766 114675 > > 2117258 2119596 2116816 1205446 2119176 2119209 > > 2116792 2118763 2118773 2117762 TLB shootdowns > > > > 4) from 'perf stat -a' > > > > 60851902 itlb.itlb_flush > > 334068491 tlb_flush.dtlb_thread > > 223732916 tlb_flush.stlb_any > > 120207083382 dTLB-load-misses > > 446823059 dTLB-store-misses > > 1926669373 iTLB-load-misses > > > > Thanks a lot for test results! > > >From your test results, the TLB shootdown IPI can be reduced effectively > with commit 7e12beb8ca2a. So that the benchmark score improved a > little. > > And, your changes will reduce the TLB shootdown IPI further, right? Do Yes, right. LUF(Lazy Unmap Flush) reduces TLB shootdown IPI further. > you have the number? You can find the number obtained from llama.cpp in this cover letter: https://lore.kernel.org/lkml/20240520021734.21527-1-byungchul@xxxxxx/ If you meant the number from the same test above, XSBench + qemu, I will re-test with mm-unstable branch of mm tree and share the result shortly. Byungchul