On Wed, Dec 11, 2013 at 11:24:31AM +0100, Vlastimil Babka wrote: > Changelog since V1 (thanks to the reviewers!) > o Included "trace compaction being and end" patch in the series (mgorman) > o Changed variable names and comments in patches 2 and 5 (mgorman) > o More thorough measurements, based on v3.13-rc2 > > The broad goal of the series is to improve allocation success rates for huge > pages through memory compaction, while trying not to increase the compaction > overhead. The original objective was to reintroduce capturing of high-order > pages freed by the compaction, before they are split by concurrent activity. > However, several bugs and opportunities for simple improvements were found in > the current implementation, mostly through extra tracepoints (which are however > too ugly for now to be considered for sending). > > The patches mostly deal with two mechanisms that reduce compaction overhead, > which is caching the progress of migrate and free scanners, and marking > pageblocks where isolation failed to be skipped during further scans. > > Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in > compaction and potentially debug scanner pfn values. > > Patch 2 encapsulates the some functionality for handling deferred compactions > for better maintainability, without a functional change > type is not determined without being actually needed. > > Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after > they have been read to initialize a compaction run. > > Patch 4 fixes a bug where scanners meeting is sometimes not properly detected > and can lead to multiple compaction attempts quitting early without > doing any work. > > Patch 5 improves the chances of sync compaction to process pageblocks that > async compaction has skipped due to being !MIGRATE_MOVABLE. > > Patch 6 improves the chances of sync direct compaction to actually do anything > when called after async compaction fails during allocation slowpath. > > The impact of patches were validated using mmtests's stress-highalloc benchmark > with mmtests's stress-highalloc benchmark on a x86_64 machine with 4GB memory. > > Due to instability of the results (mostly related to the bugs fixed by patches > 2 and 3), 10 iterations were performed, taking min,mean,max values for success > rates and mean values for time and vmstat-based metrics. > > First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the patches > stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline due to no > functional changes in 1 and 2. Comments below. > > stress-highalloc > 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 > 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp > Success 1 Min 9.00 ( 0.00%) 10.00 (-11.11%) 43.00 (-377.78%) 43.00 (-377.78%) 33.00 (-266.67%) > Success 1 Mean 27.50 ( 0.00%) 25.30 ( 8.00%) 45.50 (-65.45%) 45.90 (-66.91%) 46.30 (-68.36%) > Success 1 Max 36.00 ( 0.00%) 36.00 ( 0.00%) 47.00 (-30.56%) 48.00 (-33.33%) 52.00 (-44.44%) > Success 2 Min 10.00 ( 0.00%) 8.00 ( 20.00%) 46.00 (-360.00%) 45.00 (-350.00%) 35.00 (-250.00%) > Success 2 Mean 26.40 ( 0.00%) 23.50 ( 10.98%) 47.30 (-79.17%) 47.60 (-80.30%) 48.10 (-82.20%) > Success 2 Max 34.00 ( 0.00%) 33.00 ( 2.94%) 48.00 (-41.18%) 50.00 (-47.06%) 54.00 (-58.82%) > Success 3 Min 65.00 ( 0.00%) 63.00 ( 3.08%) 85.00 (-30.77%) 84.00 (-29.23%) 85.00 (-30.77%) > Success 3 Mean 76.70 ( 0.00%) 70.50 ( 8.08%) 86.20 (-12.39%) 85.50 (-11.47%) 86.00 (-12.13%) > Success 3 Max 87.00 ( 0.00%) 86.00 ( 1.15%) 88.00 ( -1.15%) 87.00 ( 0.00%) 87.00 ( 0.00%) > > 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 > 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp > User 6437.72 6459.76 5960.32 5974.55 6019.67 > System 1049.65 1049.09 1029.32 1031.47 1032.31 > Elapsed 1856.77 1874.48 1949.97 1994.22 1983.15 > > 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 > 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp > Minor Faults 253952267 254581900 250030122 250507333 250157829 > Major Faults 420 407 506 530 530 > Swap Ins 4 9 9 6 6 > Swap Outs 398 375 345 346 333 > Direct pages scanned 197538 189017 298574 287019 299063 > Kswapd pages scanned 1809843 1801308 1846674 1873184 1861089 > Kswapd pages reclaimed 1806972 1798684 1844219 1870509 1858622 > Direct pages reclaimed 197227 188829 298380 286822 298835 > Kswapd efficiency 99% 99% 99% 99% 99% > Kswapd velocity 953.382 970.449 952.243 934.569 922.286 > Direct efficiency 99% 99% 99% 99% 99% > Direct velocity 104.058 101.832 153.961 143.200 148.205 > Percentage direct scans 9% 9% 13% 13% 13% > Zone normal velocity 347.289 359.676 348.063 339.933 332.983 > Zone dma32 velocity 710.151 712.605 758.140 737.835 737.507 > Zone dma velocity 0.000 0.000 0.000 0.000 0.000 > Page writes by reclaim 557.600 429.000 353.600 426.400 381.800 > Page writes file 159 53 7 79 48 > Page writes anon 398 375 345 346 333 > Page reclaim immediate 825 644 411 575 420 > Sector Reads 2781750 2769780 2878547 2939128 2910483 > Sector Writes 12080843 12083351 12012892 12002132 12010745 > Page rescued immediate 0 0 0 0 0 > Slabs scanned 1575654 1545344 1778406 1786700 1794073 > Direct inode steals 9657 10037 15795 14104 14645 > Kswapd inode steals 46857 46335 50543 50716 51796 > Kswapd skipped wait 0 0 0 0 0 > THP fault alloc 97 91 81 71 77 > THP collapse alloc 456 506 546 544 565 > THP splits 6 5 5 4 4 > THP fault fallback 0 1 0 0 0 > THP collapse fail 14 14 12 13 12 > Compaction stalls 1006 980 1537 1536 1548 > Compaction success 303 284 562 559 578 > Compaction failures 702 696 974 976 969 > Page migrate success 1177325 1070077 3927538 3781870 3877057 > Page migrate failure 0 0 0 0 0 > Compaction pages isolated 2547248 2306457 8301218 8008500 8200674 > Compaction migrate scanned 42290478 38832618 153961130 154143900 159141197 > Compaction free scanned 89199429 79189151 356529027 351943166 356326727 > Compaction cost 1566 1426 5312 5156 5294 > NUMA PTE updates 0 0 0 0 0 > NUMA hint faults 0 0 0 0 0 > NUMA hint local faults 0 0 0 0 0 > NUMA hint local percent 100 100 100 100 100 > NUMA pages migrated 0 0 0 0 0 > AutoNUMA cost 0 0 0 0 0 > > > Observations: > - The "Success 3" line is allocation success rate with system idle (phases 1 > and 2 are with background interference). I used to get stable values around > 85% with vanilla 3.11. The lower min and mean values came with 3.12. > This was bisected to commit 81c0a2bb ("mm: page_alloc: fair zone allocator > policy") As explained in comment for patch 3, I don't think the commit is > wrong, but that it makes the effect of compaction bugs worse. From patch 3 > onwards, the results are OK and match the 3.11 results. > - Patch 4 also clearly helps phases 1 and 2, and exceeds any results I've > seen with 3.11 (I didn't measure it that thoroughly then, but it was never > above 40%). > - Compaction cost and number of scanned pages is higher, especially due to > patch 4. However, keep in mind that patches 3 and 4 fix existing bugs in the > current design of compaction overhead mitigation, they do not change it. > If overhead is found unacceptable, then it should be decreased differently > (and consistently, not due to random conditions) than the current implementation > does. In contrast, patches 5 and 6 (which are not strictly bug fixes) do not > increase the overhead (but also not success rates). This might be a limitation > of the stress-highalloc benchmark as it's quite uniform. > > Another set of results is when configuring stress-highalloc t allocate > with similar flags as THP uses: > (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD) > > stress-highalloc > 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 > 2-thp 3-thp 4-thp 5-thp 6-thp > Success 1 Min 2.00 ( 0.00%) 7.00 (-250.00%) 18.00 (-800.00%) 19.00 (-850.00%) 26.00 (-1200.00%) > Success 1 Mean 19.20 ( 0.00%) 17.80 ( 7.29%) 29.20 (-52.08%) 29.90 (-55.73%) 32.80 (-70.83%) > Success 1 Max 27.00 ( 0.00%) 29.00 ( -7.41%) 35.00 (-29.63%) 36.00 (-33.33%) 37.00 (-37.04%) > Success 2 Min 3.00 ( 0.00%) 8.00 (-166.67%) 21.00 (-600.00%) 21.00 (-600.00%) 32.00 (-966.67%) > Success 2 Mean 19.30 ( 0.00%) 17.90 ( 7.25%) 32.20 (-66.84%) 32.60 (-68.91%) 35.70 (-84.97%) > Success 2 Max 27.00 ( 0.00%) 30.00 (-11.11%) 36.00 (-33.33%) 37.00 (-37.04%) 39.00 (-44.44%) > Success 3 Min 62.00 ( 0.00%) 62.00 ( 0.00%) 85.00 (-37.10%) 75.00 (-20.97%) 64.00 ( -3.23%) > Success 3 Mean 66.30 ( 0.00%) 65.50 ( 1.21%) 85.60 (-29.11%) 83.40 (-25.79%) 83.50 (-25.94%) > Success 3 Max 70.00 ( 0.00%) 69.00 ( 1.43%) 87.00 (-24.29%) 86.00 (-22.86%) 87.00 (-24.29%) > > 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 > 2-thp 3-thp 4-thp 5-thp 6-thp > User 6547.93 6475.85 6265.54 6289.46 6189.96 > System 1053.42 1047.28 1043.23 1042.73 1038.73 > Elapsed 1835.43 1821.96 1908.67 1912.74 1956.38 Hello, Vlastimil. I have some questions related to your stat, not your patchset, just for curiosity. :) Are these results, "elapsed time" and "vmstat", for Success 3 line scenario? If so, could you show me others? I wonder why thp case consumes more system time rather than no-thp case. And I found that elapsed time has no big difference between both cases, roughly less than 2%. In this situation, do we get more benefits with aggressive allocation like no-thp case? Thanks. > > 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 > 2-thp 3-thp 4-thp 5-thp 6-thp > Minor Faults 256805673 253106328 253222299 249830289 251184418 > Major Faults 395 375 423 434 448 > Swap Ins 12 10 10 12 9 > Swap Outs 530 537 487 455 415 > Direct pages scanned 71859 86046 153244 152764 190713 > Kswapd pages scanned 1900994 1870240 1898012 1892864 1880520 > Kswapd pages reclaimed 1897814 1867428 1894939 1890125 1877924 > Direct pages reclaimed 71766 85908 153167 152643 190600 > Kswapd efficiency 99% 99% 99% 99% 99% > Kswapd velocity 1029.000 1067.782 1000.091 991.049 951.218 > Direct efficiency 99% 99% 99% 99% 99% > Direct velocity 38.897 49.127 80.747 79.983 96.468 > Percentage direct scans 3% 4% 7% 7% 9% > Zone normal velocity 351.377 372.494 348.910 341.689 335.310 > Zone dma32 velocity 716.520 744.414 731.928 729.343 712.377 > Zone dma velocity 0.000 0.000 0.000 0.000 0.000 > Page writes by reclaim 669.300 604.000 545.700 538.900 429.900 > Page writes file 138 66 58 83 14 > Page writes anon 530 537 487 455 415 > Page reclaim immediate 806 655 772 548 517 > Sector Reads 2711956 2703239 2811602 2818248 2839459 > Sector Writes 12163238 12018662 12038248 11954736 11994892 > Page rescued immediate 0 0 0 0 0 > Slabs scanned 1385088 1388364 1507968 1513292 1558656 > Direct inode steals 1739 2564 4622 5496 6007 > Kswapd inode steals 47461 46406 47804 48013 48466 > Kswapd skipped wait 0 0 0 0 0 > THP fault alloc 110 82 84 69 70 > THP collapse alloc 445 482 467 462 539 > THP splits 6 5 4 5 3 > THP fault fallback 3 0 0 0 0 > THP collapse fail 15 14 14 14 13 > Compaction stalls 659 685 1033 1073 1111 > Compaction success 222 225 410 427 456 > Compaction failures 436 460 622 646 655 > Page migrate success 446594 439978 1085640 1095062 1131716 > Page migrate failure 0 0 0 0 0 > Compaction pages isolated 1029475 1013490 2453074 2482698 2565400 > Compaction migrate scanned 9955461 11344259 24375202 27978356 30494204 > Compaction free scanned 27715272 28544654 80150615 82898631 85756132 > Compaction cost 552 555 1344 1379 1436 > NUMA PTE updates 0 0 0 0 0 > NUMA hint faults 0 0 0 0 0 > NUMA hint local faults 0 0 0 0 0 > NUMA hint local percent 100 100 100 100 100 > NUMA pages migrated 0 0 0 0 0 > AutoNUMA cost 0 0 0 0 0 > > There are some differences from the previous results for THP-like allocations: > - Here, the bad result for unpatched kernel in phase 3 is much more consistent > to be between 65-70% and not related to the "regression" in 3.12. Still there is > the improvement from patch 4 onwards, which brings it on par with simple > GFP_HIGHUSER_MOVABLE allocations. > - Compaction costs have increased, but nowhere near as much as the non-THP case. Again, > the patches should be worth the gained determininsm. > - Patches 5 and 6 somewhat increase the number of migrate-scanned pages. This is most likely > due to __GFP_NO_KSWAPD flag, which means the cached pfn's and pageblock skip bits are not > reset by kswapd that often (at least in phase 3 where no concurrent activity would wake > up kswapd) and the patches thus help the sync-after-async compaction. It doesn't however > show that the sync compaction would help so much with success rates, which can be again > seen as a limitation of the benchmark scenario. > > > > Mel Gorman (1): > mm: compaction: trace compaction begin and end > > Vlastimil Babka (5): > mm: compaction: encapsulate defer reset logic > mm: compaction: reset cached scanner pfn's before reading them > mm: compaction: detect when scanners meet in isolate_freepages > mm: compaction: do not mark unmovable pageblocks as skipped in async > compaction > mm: compaction: reset scanner positions immediately when they meet > > include/linux/compaction.h | 16 ++++++++++ > include/trace/events/compaction.h | 42 +++++++++++++++++++++++++++ > mm/compaction.c | 61 +++++++++++++++++++++++++++------------ > mm/page_alloc.c | 5 +--- > 4 files changed, 102 insertions(+), 22 deletions(-) > > -- > 1.8.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>