On Wed, Apr 11, 2012 at 9:38 AM, Mel Gorman <mgorman@xxxxxxx> wrote: > Andrew, these three patches should replace the two lumpy reclaim patches > you already have. When applied, there is no functional difference (slightly > changes in layout) but the changelogs are better. > > Changelog since V1 > o Ying pointed out that compaction was waiting on page writeback and the > description of the patches in V1 was broken. This version is the same > except that it is structured differently to explain that waiting on > page writeback is removed. > o Rebased to v3.4-rc2 > > This series removes lumpy reclaim and some stalling logic that was > unintentionally being used by memory compaction. The end result > is that stalling on dirty pages during page reclaim now depends on > wait_iff_congested(). > > Four kernels were compared > > 3.3.0 vanilla > 3.4.0-rc2 vanilla > 3.4.0-rc2 lumpyremove-v2 is patch one from this series > 3.4.0-rc2 nosync-v2r3 is the full series > > Removing lumpy reclaim saves almost 900K of text where as the full series > removes 1200K of text. > > text data bss dec hex filename > 6740375 1927944 2260992 10929311 a6c49f vmlinux-3.4.0-rc2-vanilla > 6739479 1927944 2260992 10928415 a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2 > 6739159 1927944 2260992 10928095 a6bfdf vmlinux-3.4.0-rc2-nosync-v2 > > There are behaviour changes in the series and so tests were run with > monitoring of ftrace events. This disrupts results so the performance > results are distorted but the new behaviour should be clearer. > > fs-mark running in a threaded configuration showed little of interest as > it did not push reclaim aggressively > > FS-Mark Multi Threaded > 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 > Files/s min 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) > Files/s mean 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) > Files/s stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Files/s max 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) > Overhead min 508667.00 ( 0.00%) 521350.00 (-2.49%) 544292.00 (-7.00%) 547168.00 (-7.57%) > Overhead mean 551185.00 ( 0.00%) 652690.73 (-18.42%) 991208.40 (-79.83%) 570130.53 (-3.44%) > Overhead stddev 18200.69 ( 0.00%) 331958.29 (-1723.88%) 1579579.43 (-8578.68%) 9576.81 (47.38%) > Overhead max 576775.00 ( 0.00%) 1846634.00 (-220.17%) 6901055.00 (-1096.49%) 585675.00 (-1.54%) > MMTests Statistics: duration > Sys Time Running Test (seconds) 309.90 300.95 307.33 298.95 > User+Sys Time Running Test (seconds) 319.32 309.67 315.69 307.51 > Total Elapsed Time (seconds) 1187.85 1193.09 1191.98 1193.73 > > MMTests Statistics: vmstat > Page Ins 80532 82212 81420 79480 > Page Outs 111434984 111456240 111437376 111582628 > Swap Ins 0 0 0 0 > Swap Outs 0 0 0 0 > Direct pages scanned 44881 27889 27453 34843 > Kswapd pages scanned 25841428 25860774 25861233 25843212 > Kswapd pages reclaimed 25841393 25860741 25861199 25843179 > Direct pages reclaimed 44881 27889 27453 34843 > Kswapd efficiency 99% 99% 99% 99% > Kswapd velocity 21754.791 21675.460 21696.029 21649.127 > Direct efficiency 100% 100% 100% 100% > Direct velocity 37.783 23.375 23.031 29.188 > Percentage direct scans 0% 0% 0% 0% > > ftrace showed that there was no stalling on writeback or pages submitted > for IO from reclaim context. > > > postmark was similar and while it was more interesting, it also did not > push reclaim heavily. > > POSTMARK > 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 > Transactions per second: 16.00 ( 0.00%) 20.00 (25.00%) 18.00 (12.50%) 17.00 ( 6.25%) > Data megabytes read per second: 18.80 ( 0.00%) 24.27 (29.10%) 22.26 (18.40%) 20.54 ( 9.26%) > Data megabytes written per second: 35.83 ( 0.00%) 46.25 (29.08%) 42.42 (18.39%) 39.14 ( 9.24%) > Files created alone per second: 28.00 ( 0.00%) 38.00 (35.71%) 34.00 (21.43%) 30.00 ( 7.14%) > Files create/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%) > Files deleted alone per second: 556.00 ( 0.00%) 1224.00 (120.14%) 3062.00 (450.72%) 6124.00 (1001.44%) > Files delete/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%) > > MMTests Statistics: duration > Sys Time Running Test (seconds) 113.34 107.99 109.73 108.72 > User+Sys Time Running Test (seconds) 145.51 139.81 143.32 143.55 > Total Elapsed Time (seconds) 1159.16 899.23 980.17 1062.27 > > MMTests Statistics: vmstat > Page Ins 13710192 13729032 13727944 13760136 > Page Outs 43071140 42987228 42733684 42931624 > Swap Ins 0 0 0 0 > Swap Outs 0 0 0 0 > Direct pages scanned 0 0 0 0 > Kswapd pages scanned 9941613 9937443 9939085 9929154 > Kswapd pages reclaimed 9940926 9936751 9938397 9928465 > Direct pages reclaimed 0 0 0 0 > Kswapd efficiency 99% 99% 99% 99% > Kswapd velocity 8576.567 11051.058 10140.164 9347.109 > Direct efficiency 100% 100% 100% 100% > Direct velocity 0.000 0.000 0.000 0.000 > > It looks like here that the full series regresses performance but as ftrace > showed no usage of wait_iff_congested() or sync reclaim I am assuming it's > a disruption due to monitoring. Other data such as memory usage, page IO, > swap IO all looked similar. > > Running a benchmark with a plain DD showed nothing very interesting. The > full series stalled in wait_iff_congested() slightly less but stall times > on vanilla kernels were marginal. > > Running a benchmark that hammered on file-backed mappings showed stalls > due to congestion but not in sync writebacks > > MICRO > 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 > MMTests Statistics: duration > Sys Time Running Test (seconds) 308.13 294.50 298.75 299.53 > User+Sys Time Running Test (seconds) 330.45 316.28 318.93 320.79 > Total Elapsed Time (seconds) 1814.90 1833.88 1821.14 1832.91 > > MMTests Statistics: vmstat > Page Ins 108712 120708 97224 110344 > Page Outs 155514576 156017404 155813676 156193256 > Swap Ins 0 0 0 0 > Swap Outs 0 0 0 0 > Direct pages scanned 2599253 1550480 2512822 2414760 > Kswapd pages scanned 69742364 71150694 68839041 69692533 > Kswapd pages reclaimed 34824488 34773341 34796602 34799396 > Direct pages reclaimed 53693 94750 61792 75205 > Kswapd efficiency 49% 48% 50% 49% > Kswapd velocity 38427.662 38797.901 37799.972 38022.889 > Direct efficiency 2% 6% 2% 3% > Direct velocity 1432.174 845.464 1379.807 1317.446 > Percentage direct scans 3% 2% 3% 3% > Page writes by reclaim 0 0 0 0 > Page writes file 0 0 0 0 > Page writes anon 0 0 0 0 > Page reclaim immediate 0 0 0 1218 > Page rescued immediate 0 0 0 0 > Slabs scanned 15360 16384 13312 16384 > Direct inode steals 0 0 0 0 > Kswapd inode steals 4340 4327 1630 4323 > > FTrace Reclaim Statistics: congestion_wait > Direct number congest waited 0 0 0 0 > Direct time congest waited 0ms 0ms 0ms 0ms > Direct full congest waited 0 0 0 0 > Direct number conditional waited 900 870 754 789 > Direct time conditional waited 0ms 0ms 0ms 20ms > Direct full conditional waited 0 0 0 0 > KSwapd number congest waited 2106 2308 2116 1915 > KSwapd time congest waited 139924ms 157832ms 125652ms 132516ms > KSwapd full congest waited 1346 1530 1202 1278 > KSwapd number conditional waited 12922 16320 10943 14670 > KSwapd time conditional waited 0ms 0ms 0ms 0ms > KSwapd full conditional waited 0 0 0 0 > > > Reclaim statistics are not radically changed. The stall times in kswapd > are massive but it is clear that it is due to calls to congestion_wait() > and that is almost certainly the call in balance_pgdat(). Otherwise stalls > due to dirty pages are non-existant. > > I ran a benchmark that stressed high-order allocation. This is very > artifical load but was used in the past to evaluate lumpy reclaim and > compaction. Generally I look at allocation success rates and latency figures. > > STRESS-HIGHALLOC > 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 > Pass 1 81.00 ( 0.00%) 28.00 (-53.00%) 24.00 (-57.00%) 28.00 (-53.00%) > Pass 2 82.00 ( 0.00%) 39.00 (-43.00%) 38.00 (-44.00%) 43.00 (-39.00%) > while Rested 88.00 ( 0.00%) 87.00 (-1.00%) 88.00 ( 0.00%) 88.00 ( 0.00%) > > MMTests Statistics: duration > Sys Time Running Test (seconds) 740.93 681.42 685.14 684.87 > User+Sys Time Running Test (seconds) 2922.65 3269.52 3281.35 3279.44 > Total Elapsed Time (seconds) 1161.73 1152.49 1159.55 1161.44 > > MMTests Statistics: vmstat > Page Ins 4486020 2807256 2855944 2876244 > Page Outs 7261600 7973688 7975320 7986120 > Swap Ins 31694 0 0 0 > Swap Outs 98179 0 0 0 > Direct pages scanned 53494 57731 34406 113015 > Kswapd pages scanned 6271173 1287481 1278174 1219095 > Kswapd pages reclaimed 2029240 1281025 1260708 1201583 > Direct pages reclaimed 1468 14564 16649 92456 > Kswapd efficiency 32% 99% 98% 98% > Kswapd velocity 5398.133 1117.130 1102.302 1049.641 > Direct efficiency 2% 25% 48% 81% > Direct velocity 46.047 50.092 29.672 97.306 > Percentage direct scans 0% 4% 2% 8% > Page writes by reclaim 1616049 0 0 0 > Page writes file 1517870 0 0 0 > Page writes anon 98179 0 0 0 > Page reclaim immediate 103778 27339 9796 17831 > Page rescued immediate 0 0 0 0 > Slabs scanned 1096704 986112 980992 998400 > Direct inode steals 223 215040 216736 247881 > Kswapd inode steals 175331 61548 68444 63066 > Kswapd skipped wait 21991 0 1 0 > THP fault alloc 1 135 125 134 > THP collapse alloc 393 311 228 236 > THP splits 25 13 7 8 > THP fault fallback 0 0 0 0 > THP collapse fail 3 5 7 7 > Compaction stalls 865 1270 1422 1518 > Compaction success 370 401 353 383 > Compaction failures 495 869 1069 1135 > Compaction pages moved 870155 3828868 4036106 4423626 > Compaction move failure 26429 23865 29742 27514 > > Success rates are completely hosed for 3.4-rc2 which is almost certainly > due to [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. I > expected this would happen for kswapd and impair allocation success rates > (https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this much > a difference: 80% less scanning, 37% less reclaim by kswapd > > In comparison, reclaim/compaction is not aggressive and gives up easily > which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would be > much more aggressive about reclaim/compaction than THP allocations are. The > stress test above is allocating like neither THP or hugetlbfs but is much > closer to THP. > > Mainline is now impaired in terms of high order allocation under heavy load > although I do not know to what degree as I did not test with __GFP_REPEAT. > Keep this in mind for bugs related to hugepage pool resizing, THP allocation > and high order atomic allocation failures from network devices. > > In terms of congestion throttling, I see the following for this test > > FTrace Reclaim Statistics: congestion_wait > Direct number congest waited 3 0 0 0 > Direct time congest waited 0ms 0ms 0ms 0ms > Direct full congest waited 0 0 0 0 > Direct number conditional waited 957 512 1081 1075 > Direct time conditional waited 0ms 0ms 0ms 0ms > Direct full conditional waited 0 0 0 0 > KSwapd number congest waited 36 4 3 5 > KSwapd time congest waited 3148ms 400ms 300ms 500ms > KSwapd full congest waited 30 4 3 5 > KSwapd number conditional waited 88514 197 332 542 > KSwapd time conditional waited 4980ms 0ms 0ms 0ms > KSwapd full conditional waited 49 0 0 0 > > The "conditional waited" times are the most interesting as this is directly > impacted by the number of dirty pages encountered during scan. As lumpy > reclaim is no longer scanning contiguous ranges, it is finding fewer dirty > pages. This brings wait times from about 5 seconds to 0. kswapd itself is > still calling congestion_wait() so it'll still stall but it's a lot less. > > In terms of the type of IO we were doing, I see this > > FTrace Reclaim Statistics: mm_vmscan_writepage > Direct writes anon sync 0 0 0 0 > Direct writes anon async 0 0 0 0 > Direct writes file sync 0 0 0 0 > Direct writes file async 0 0 0 0 > Direct writes mixed sync 0 0 0 0 > Direct writes mixed async 0 0 0 0 > KSwapd writes anon sync 0 0 0 0 > KSwapd writes anon async 91682 0 0 0 > KSwapd writes file sync 0 0 0 0 > KSwapd writes file async 822629 0 0 0 > KSwapd writes mixed sync 0 0 0 0 > KSwapd writes mixed async 0 0 0 0 > > In 3.2, kswapd was doing a bunch of async writes of pages but > reclaim/compaction was never reaching a point where it was doing sync > IO. This does not guarantee that reclaim/compaction was not calling > wait_on_page_writeback() but I would consider it unlikely. It indicates > that merging patches 2 and 3 to stop reclaim/compaction calling > wait_on_page_writeback() should be safe. > > include/trace/events/vmscan.h | 40 ++----- > mm/vmscan.c | 263 ++++------------------------------------- > 2 files changed, 37 insertions(+), 266 deletions(-) > > -- > 1.7.9.2 > It might be a naive question, what we do w/ users with the following in the .config file? # CONFIG_COMPACTION is not set --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href