Byungchul Park <byungchul@xxxxxx> writes: > Hi everyone, > > While I'm working with a tiered memory system e.g. CXL memory, I have > been facing migration overhead esp. tlb shootdown on promotion or > demotion between different tiers. Yeah.. most tlb shootdowns on > migration through hinting fault can be avoided thanks to Huang Ying's > work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE > is inaccessible"). See the following link for more information: > > https://lore.kernel.org/lkml/20231115025755.GA29979@xxxxxxxxxxxxxxxxxxx/ > > However, it's only for ones using hinting fault. I thought it'd be much > better if we have a general mechanism to reduce all tlb numbers that we > can ultimately apply to any type of migration. > > I'm suggesting a mechanism called MIGRC that stands for 'Migration Read > Copy', to reduce tlb numbers by deferring tlb flush until the source > folios at migration actually become used, of course, only if the target > PTE don't have write permission. > > To achieve that: > > 1. For the folios that map only to non-writable tlb entries, prevent > tlb flush during migration but perform it just before the source > folios actually become used out of buddy or pcp. > > 2. When any non-writable tlb entry changes to writable e.g. through > fault handler, give up migrc mechanism and perform tlb flush > required right away. > > No matter what type of workload is used for performance evaluation, the > result would be positive thanks to the unconditional reduction of tlb > flushes, tlb misses and interrupts. For the test, I picked up XSBench > that is widely used for performance analysis on high performance > computing architectures - https://github.com/ANL-CESAR/XSBench. > > The result would depend on memory latency and how often reclaim runs, > which implies tlb miss overhead and how many times migration happens. > The slower the memory is and the more reclaim runs, the better migrc > works so as to obtain the better result. In my system, the result > shows: > > 1. itlb flushes are reduced over 90%. > 2. itlb misses are reduced over 30%. > 3. All the other tlb numbers also get enhanced. > 4. tlb shootdown interrupts are reduced over 90%. > 5. The test program runtime is reduced over 5%. > > The test envitonment: > > Architecture - x86_64 > QEMU - kvm enabled, host cpu The test is run in VM? Do you have test results in bare metal environment? > Numa - 2 nodes (16 CPUs 1GB, no CPUs 99GB) The configuration looks quite abnormal. Have you tested with other configuration, such 1:4 or 1:8? > Linux Kernel - v6.9-rc4, numa balancing tiering on, demotion enabled > > < measurement: raw data - tlb and interrupt numbers > > > $ perf stat -a \ > -e itlb.itlb_flush \ > -e tlb_flush.dtlb_thread \ > -e tlb_flush.stlb_any \ > -e dtlb-load-misses \ > -e dtlb-store-misses \ > -e itlb-load-misses \ > XSBench -t 16 -p 50000000 > > $ grep "TLB shootdowns" /proc/interrupts > > BEFORE > ------ > 40417078 itlb.itlb_flush > 234852566 tlb_flush.dtlb_thread > 153192357 tlb_flush.stlb_any > 119001107892 dTLB-load-misses > 307921167 dTLB-store-misses > 1355272118 iTLB-load-misses > > TLB: 1364803 1303670 1333921 1349607 > 1356934 1354216 1332972 1342842 > 1350265 1316443 1355928 1360793 > 1298239 1326358 1343006 1340971 > TLB shootdowns > > AFTER > ----- > 3316495 itlb.itlb_flush > 138912511 tlb_flush.dtlb_thread > 115199341 tlb_flush.stlb_any > 117610390021 dTLB-load-misses > 198042233 dTLB-store-misses > 840066984 iTLB-load-misses > > TLB: 117257 119219 117178 115737 > 117967 118948 117508 116079 > 116962 117266 117320 117215 > 105808 103934 115672 117610 > TLB shootdowns > > < measurement: user experience - runtime > > > $ time XSBench -t 16 -p 50000000 > > BEFORE > ------ > Threads: 16 > Runtime: 968.783 seconds > Lookups: 1,700,000,000 > Lookups/s: 1,754,778 > > 15208.91s user 141.44s system 1564% cpu 16:20.98 total > > AFTER > ----- > Threads: 16 > Runtime: 913.210 seconds > Lookups: 1,700,000,000 > Lookups/s: 1,861,565 > > 14351.69s user 138.23s system 1565% cpu 15:25.47 total IIUC, the memory footprint will be larger with the patchset. Do you have data? -- Best Regards, Huang, Ying