Hi everybody, this series re-implements the LRU balancing between page cache and anonymous pages to work better with fast random IO swap devices. The LRU balancing code evolved under slow rotational disks with high seek overhead, and it had to extrapolate the cost of reclaiming a list based on in-memory reference patterns alone, which is error prone and, in combination with the high IO cost of mistakes, risky. As a result, the balancing code is now at a point where it mostly goes for page cache and avoids the random IO of swapping altogether until the VM is under significant memory pressure. With the proliferation of fast random IO devices such as SSDs and persistent memory, though, swap becomes interesting again, not just as a last-resort overflow, but as an extension of memory that can be used to optimize the in-memory balance between the page cache and the anonymous workingset even during moderate load. Our current reclaim choices don't exploit the potential of this hardware. This series sets out to address this. Having exact tracking of refault IO - the ultimate cost of reclaiming the wrong pages - allows us to use an IO cost based balancing model that is more aggressive about swapping on fast backing devices while holding back on existing setups that still use rotational storage. These patches base the LRU balancing on the rate of refaults on each list, times the relative IO cost between swap device and filesystem (swappiness), in order to optimize reclaim for least IO cost incurred. --- The following postgres benchmark demonstrates the benefits of this new model. The machine has 7G, the database is 5.6G with 1G for shared buffers, and the system has a little over 1G worth of anonymous pages from mostly idle processes and tmpfs files. The filesystem is on spinning rust, the swap partition is on an SSD; swappiness is set to 115 to ballpark the relative IO cost between them. The test run is preceded by 30 minutes of warmup using the same workload: transaction type: TPC-B (sort of) scaling factor: 420 query mode: simple number of clients: 8 number of threads: 4 duration: 3600 s vanilla: number of transactions actually processed: 290360 latency average: 99.187 ms latency stddev: 261.171 ms tps = 80.654848 (including connections establishing) tps = 80.654878 (excluding connections establishing) patched: number of transactions actually processed: 377960 latency average: 76.198 ms latency stddev: 229.411 ms tps = 104.987704 (including connections establishing) tps = 104.987743 (excluding connections establishing) The patched kernel shows a 30% increase in throughput, and a 23% decrease in average latency. Latency variance is reduced as well. The reclaim statistics explain the difference in behavior: PGBENCH5.6G-vanilla PGBENCH5.6G-lrucost Real time 3600.49 ( +0.00%) 3600.26 ( -0.01%) User time 17.85 ( +0.00%) 18.80 ( +5.05%) System time 17.52 ( +0.00%) 17.02 ( -2.72%) Allocation stalls 3.00 ( +0.00%) 0.00 ( -75.00%) Anon scanned 6579.00 ( +0.00%) 201845.00 (+2967.57%) Anon reclaimed 3426.00 ( +0.00%) 86924.00 (+2436.48%) Anon reclaim efficiency 52.07 ( +0.00%) 43.06 ( -16.98%) File scanned 364444.00 ( +0.00%) 27706.00 ( -92.40%) File reclaimed 363136.00 ( +0.00%) 27366.00 ( -92.46%) File reclaim efficiency 99.64 ( +0.00%) 98.77 ( -0.86%) Swap out 3149.00 ( +0.00%) 86932.00 (+2659.78%) Swap in 313.00 ( +0.00%) 503.00 ( +60.51%) File refault 222486.00 ( +0.00%) 101041.00 ( -54.59%) Total refaults 222799.00 ( +0.00%) 101544.00 ( -54.42%) The patched kernel works much harder to find idle anonymous pages in order to alleviate the thrashing of the page cache. And it pays off: overall, refault IO is cut in half, more time is spent in userspace, less time is spent in the kernel. --- The parallelio test from the mmtests package shows the backward compatibility of the new model. It runs a memcache workload while copying large files in parallel. The page cache isn't thrashing, so the VM shouldn't swap except to relieve immediate memory pressure. Swappiness is reset to the default setting of 60 as well. parallelio Transactions vanilla lrucost 60 60 Min memcachetest-0M 83736.00 ( 0.00%) 84376.00 ( 0.76%) Min memcachetest-769M 83708.00 ( 0.00%) 85038.00 ( 1.59%) Min memcachetest-2565M 85419.00 ( 0.00%) 85740.00 ( 0.38%) Min memcachetest-4361M 85979.00 ( 0.00%) 86746.00 ( 0.89%) Hmean memcachetest-0M 84805.85 ( 0.00%) 84852.31 ( 0.05%) Hmean memcachetest-769M 84273.56 ( 0.00%) 85160.52 ( 1.05%) Hmean memcachetest-2565M 85792.43 ( 0.00%) 85967.59 ( 0.20%) Hmean memcachetest-4361M 86212.90 ( 0.00%) 86891.87 ( 0.79%) Stddev memcachetest-0M 959.16 ( 0.00%) 339.07 ( 64.65%) Stddev memcachetest-769M 421.00 ( 0.00%) 110.07 ( 73.85%) Stddev memcachetest-2565M 277.86 ( 0.00%) 252.33 ( 9.19%) Stddev memcachetest-4361M 193.55 ( 0.00%) 106.30 ( 45.08%) CoeffVar memcachetest-0M 1.13 ( 0.00%) 0.40 ( 64.66%) CoeffVar memcachetest-769M 0.50 ( 0.00%) 0.13 ( 74.13%) CoeffVar memcachetest-2565M 0.32 ( 0.00%) 0.29 ( 9.37%) CoeffVar memcachetest-4361M 0.22 ( 0.00%) 0.12 ( 45.51%) Max memcachetest-0M 86067.00 ( 0.00%) 85129.00 ( -1.09%) Max memcachetest-769M 84715.00 ( 0.00%) 85305.00 ( 0.70%) Max memcachetest-2565M 86084.00 ( 0.00%) 86320.00 ( 0.27%) Max memcachetest-4361M 86453.00 ( 0.00%) 86996.00 ( 0.63%) parallelio Background IO vanilla lrucost 60 60 Min io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Min io-duration-769M 6.00 ( 0.00%) 6.00 ( 0.00%) Min io-duration-2565M 21.00 ( 0.00%) 21.00 ( 0.00%) Min io-duration-4361M 36.00 ( 0.00%) 37.00 ( -2.78%) Amean io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Amean io-duration-769M 6.67 ( 0.00%) 6.67 ( 0.00%) Amean io-duration-2565M 21.67 ( 0.00%) 21.67 ( 0.00%) Amean io-duration-4361M 36.33 ( 0.00%) 37.00 ( -1.83%) Stddev io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Stddev io-duration-769M 0.47 ( 0.00%) 0.47 ( 0.00%) Stddev io-duration-2565M 0.47 ( 0.00%) 0.47 ( 0.00%) Stddev io-duration-4361M 0.47 ( 0.00%) 0.00 (100.00%) CoeffVar io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) CoeffVar io-duration-769M 7.07 ( 0.00%) 7.07 ( 0.00%) CoeffVar io-duration-2565M 2.18 ( 0.00%) 2.18 ( 0.00%) CoeffVar io-duration-4361M 1.30 ( 0.00%) 0.00 (100.00%) Max io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Max io-duration-769M 7.00 ( 0.00%) 7.00 ( 0.00%) Max io-duration-2565M 22.00 ( 0.00%) 22.00 ( 0.00%) Max io-duration-4361M 37.00 ( 0.00%) 37.00 ( 0.00%) parallelio Swap totals vanilla lrucost 60 60 Min swapin-0M 244169.00 ( 0.00%) 281418.00 (-15.26%) Min swapin-769M 269973.00 ( 0.00%) 231669.00 ( 14.19%) Min swapin-2565M 204356.00 ( 0.00%) 188934.00 ( 7.55%) Min swapin-4361M 178044.00 ( 0.00%) 147799.00 ( 16.99%) Min swaptotal-0M 810441.00 ( 0.00%) 832580.00 ( -2.73%) Min swaptotal-769M 827282.00 ( 0.00%) 705879.00 ( 14.67%) Min swaptotal-2565M 690422.00 ( 0.00%) 656948.00 ( 4.85%) Min swaptotal-4361M 660507.00 ( 0.00%) 582026.00 ( 11.88%) Min minorfaults-0M 2677904.00 ( 0.00%) 2706086.00 ( -1.05%) Min minorfaults-769M 2731412.00 ( 0.00%) 2606587.00 ( 4.57%) Min minorfaults-2565M 2599647.00 ( 0.00%) 2572429.00 ( 1.05%) Min minorfaults-4361M 2573117.00 ( 0.00%) 2514047.00 ( 2.30%) Min majorfaults-0M 82864.00 ( 0.00%) 98005.00 (-18.27%) Min majorfaults-769M 95047.00 ( 0.00%) 78789.00 ( 17.11%) Min majorfaults-2565M 69486.00 ( 0.00%) 65934.00 ( 5.11%) Min majorfaults-4361M 60009.00 ( 0.00%) 50955.00 ( 15.09%) Amean swapin-0M 291429.67 ( 0.00%) 290184.67 ( 0.43%) Amean swapin-769M 294641.33 ( 0.00%) 247553.33 ( 15.98%) Amean swapin-2565M 224398.67 ( 0.00%) 199541.33 ( 11.08%) Amean swapin-4361M 188710.67 ( 0.00%) 155103.67 ( 17.81%) Amean swaptotal-0M 877847.33 ( 0.00%) 842476.33 ( 4.03%) Amean swaptotal-769M 860593.67 ( 0.00%) 765749.00 ( 11.02%) Amean swaptotal-2565M 724284.33 ( 0.00%) 674759.67 ( 6.84%) Amean swaptotal-4361M 669080.67 ( 0.00%) 594949.33 ( 11.08%) Amean minorfaults-0M 2743339.00 ( 0.00%) 2707815.33 ( 1.29%) Amean minorfaults-769M 2740174.33 ( 0.00%) 2656168.33 ( 3.07%) Amean minorfaults-2565M 2624234.00 ( 0.00%) 2579847.00 ( 1.69%) Amean minorfaults-4361M 2582434.67 ( 0.00%) 2525946.33 ( 2.19%) Amean majorfaults-0M 99845.67 ( 0.00%) 101007.33 ( -1.16%) Amean majorfaults-769M 101037.67 ( 0.00%) 87706.00 ( 13.19%) Amean majorfaults-2565M 74771.67 ( 0.00%) 68243.67 ( 8.73%) Amean majorfaults-4361M 62557.33 ( 0.00%) 52668.33 ( 15.81%) Stddev swapin-0M 33554.61 ( 0.00%) 6370.43 ( 81.01%) Stddev swapin-769M 18283.19 ( 0.00%) 11586.05 ( 36.63%) Stddev swapin-2565M 14314.16 ( 0.00%) 9023.96 ( 36.96%) Stddev swapin-4361M 11000.92 ( 0.00%) 6770.47 ( 38.46%) Stddev swaptotal-0M 47680.16 ( 0.00%) 8319.84 ( 82.55%) Stddev swaptotal-769M 23632.76 ( 0.00%) 42426.42 (-79.52%) Stddev swaptotal-2565M 24761.63 ( 0.00%) 14504.40 ( 41.42%) Stddev swaptotal-4361M 8173.20 ( 0.00%) 9177.32 (-12.29%) Stddev minorfaults-0M 49578.82 ( 0.00%) 1928.88 ( 96.11%) Stddev minorfaults-769M 7305.53 ( 0.00%) 35084.61 (-380.25%) Stddev minorfaults-2565M 17393.80 ( 0.00%) 5259.94 ( 69.76%) Stddev minorfaults-4361M 7780.48 ( 0.00%) 10048.60 (-29.15%) Stddev majorfaults-0M 12102.64 ( 0.00%) 2178.49 ( 82.00%) Stddev majorfaults-769M 4839.82 ( 0.00%) 6313.49 (-30.45%) Stddev majorfaults-2565M 3748.79 ( 0.00%) 2707.31 ( 27.78%) Stddev majorfaults-4361M 3292.87 ( 0.00%) 1466.92 ( 55.45%) CoeffVar swapin-0M 11.51 ( 0.00%) 2.20 ( 80.93%) CoeffVar swapin-769M 6.21 ( 0.00%) 4.68 ( 24.58%) CoeffVar swapin-2565M 6.38 ( 0.00%) 4.52 ( 29.10%) CoeffVar swapin-4361M 5.83 ( 0.00%) 4.37 ( 25.12%) CoeffVar swaptotal-0M 5.43 ( 0.00%) 0.99 ( 81.82%) CoeffVar swaptotal-769M 2.75 ( 0.00%) 5.54 (-101.76%) CoeffVar swaptotal-2565M 3.42 ( 0.00%) 2.15 ( 37.12%) CoeffVar swaptotal-4361M 1.22 ( 0.00%) 1.54 (-26.28%) CoeffVar minorfaults-0M 1.81 ( 0.00%) 0.07 ( 96.06%) CoeffVar minorfaults-769M 0.27 ( 0.00%) 1.32 (-395.44%) CoeffVar minorfaults-2565M 0.66 ( 0.00%) 0.20 ( 69.24%) CoeffVar minorfaults-4361M 0.30 ( 0.00%) 0.40 (-32.04%) CoeffVar majorfaults-0M 12.12 ( 0.00%) 2.16 ( 82.21%) CoeffVar majorfaults-769M 4.79 ( 0.00%) 7.20 (-50.28%) CoeffVar majorfaults-2565M 5.01 ( 0.00%) 3.97 ( 20.87%) CoeffVar majorfaults-4361M 5.26 ( 0.00%) 2.79 ( 47.09%) Max swapin-0M 318760.00 ( 0.00%) 296366.00 ( 7.03%) Max swapin-769M 313685.00 ( 0.00%) 258977.00 ( 17.44%) Max swapin-2565M 236882.00 ( 0.00%) 210990.00 ( 10.93%) Max swapin-4361M 203852.00 ( 0.00%) 164117.00 ( 19.49%) Max swaptotal-0M 913095.00 ( 0.00%) 852936.00 ( 6.59%) Max swaptotal-769M 879597.00 ( 0.00%) 799103.00 ( 9.15%) Max swaptotal-2565M 748943.00 ( 0.00%) 692476.00 ( 7.54%) Max swaptotal-4361M 680081.00 ( 0.00%) 602448.00 ( 11.42%) Max minorfaults-0M 2797869.00 ( 0.00%) 2710507.00 ( 3.12%) Max minorfaults-769M 2749296.00 ( 0.00%) 2682591.00 ( 2.43%) Max minorfaults-2565M 2637180.00 ( 0.00%) 2584036.00 ( 2.02%) Max minorfaults-4361M 2592162.00 ( 0.00%) 2538624.00 ( 2.07%) Max majorfaults-0M 110188.00 ( 0.00%) 103107.00 ( 6.43%) Max majorfaults-769M 106900.00 ( 0.00%) 92559.00 ( 13.42%) Max majorfaults-2565M 77770.00 ( 0.00%) 72043.00 ( 7.36%) Max majorfaults-4361M 67207.00 ( 0.00%) 54538.00 ( 18.85%) vanilla lrucost 60 60 User 1108.24 1122.37 System 4636.57 4650.63 Elapsed 6046.97 6047.82 vanilla lrucost 60 60 Minor Faults 34022711 33360104 Major Faults 1014895 929273 Swap Ins 2997968 2677588 Swap Outs 6397877 5956707 Allocation stalls 27 31 DMA allocs 0 0 DMA32 allocs 15080196 14356136 Normal allocs 26177871 26662120 Movable allocs 0 0 Direct pages scanned 31625 27194 Kswapd pages scanned 33103442 27727713 Kswapd pages reclaimed 11817394 11598677 Direct pages reclaimed 21146 24043 Kswapd efficiency 35% 41% Kswapd velocity 5474.385 4584.745 Direct efficiency 66% 88% Direct velocity 5.230 4.496 Percentage direct scans 0% 0% Zone normal velocity 3786.073 3908.266 Zone dma32 velocity 1693.542 680.975 Zone dma velocity 0.000 0.000 Page writes by reclaim 6398557.000 5962129.000 Page writes file 680 5422 Page writes anon 6397877 5956707 Page reclaim immediate 3750 12647 Sector Reads 12608512 11624860 Sector Writes 49304260 47539216 Page rescued immediate 0 0 Slabs scanned 148322 164263 Direct inode steals 0 0 Kswapd inode steals 0 22 Kswapd skipped wait 0 0 THP fault alloc 6 3 THP collapse alloc 3490 3567 THP splits 0 0 THP fault fallback 0 0 THP collapse fail 13 17 Compaction stalls 431 446 Compaction success 405 416 Compaction failures 26 30 Page migrate success 199708 211181 Page migrate failure 71 121 Compaction pages isolated 425244 452352 Compaction migrate scanned 209471 226018 Compaction free scanned 20950979 23257076 Compaction cost 216 229 NUMA alloc hit 38459351 38177612 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 38455861 38174045 NUMA base PTE updates 0 0 NUMA huge PMD updates 0 0 NUMA page range updates 0 0 NUMA hint faults 0 0 NUMA hint local faults 0 0 NUMA hint local percent 100 100 NUMA pages migrated 0 0 AutoNUMA cost 0% 0% Both the memcache transactions and the background IO throughput are unchanged. Overall reclaim activity actually went down in the patched kernel, since the VM is now deterred by the swapins, whereas previously a successful swapout followed by a swapin would actually make the anon LRU more attractive (swapout is a scanned but not rotated page; swapin puts pages on the inactive list, which used to be a scan event too). The changes are fairly straight-forward, but they do require a page flag to tell inactive cache refaults (cache transition) from active ones (existing cache needs more space). On x86-32 PAE, that bumps us to 22 core flags + 7 section bits on x86 PAE + 2 zone bits = 31 bits. With the configurable hwpoison flag 32, and thus the last page flag. However, this is core VM functionality, and we can make new features 64-bit-only, like we did with the page idle tracking. Thanks Documentation/sysctl/vm.txt | 16 +++-- fs/cifs/file.c | 10 +-- fs/fuse/dev.c | 2 +- include/linux/mmzone.h | 29 ++++---- include/linux/page-flags.h | 2 + include/linux/pagevec.h | 2 +- include/linux/swap.h | 11 ++- include/trace/events/mmflags.h | 1 + kernel/sysctl.c | 3 +- mm/filemap.c | 9 +-- mm/migrate.c | 4 ++ mm/mlock.c | 2 +- mm/shmem.c | 4 +- mm/swap.c | 124 +++++++++++++++++++--------------- mm/swap_state.c | 3 +- mm/vmscan.c | 48 ++++++------- mm/vmstat.c | 6 +- mm/workingset.c | 142 +++++++++++++++++++++++++++++---------- 18 files changed, 258 insertions(+), 160 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>