This is a follow-on series from "Avoid overflowing of stack during page reclaim". It eliminates writeback requiring a filesystem from direct reclaim and follows on by reducing the amount of IO required from page reclaim to mitigate any corner cases from the modification. Most of this series updates what is already in mmotm. Changelog since V5 o Remove the writeback-related patches. They are still undergoing changes and while they complement this series, the two series do not depend on each other. Changelog since V4 o Add patch to prioritise inodes for writeback o Drop modifications to XFS and btrfs o Correct units in post-processing script o Add new patches from Wu related to writeback o Only kick flusher threads when dirty file pages are countered o Increase size of writeback window when reclaim encounters dirty pages o Remove looping logic from shrink_page_list and instead do it all from shrink_inactive_list o Rebase to 2.6.35-rc6 Changelog since V3 o Distinguish between file and anon related IO from page reclaim o Allow anon writeback from reclaim context o Sync old inodes first in background writeback o Pre-emptively clean pages when dirty pages are encountered on the LRU o Rebase to 2.6.35-rc5 Changelog since V2 o Add acks and reviewed-bys o Do not lock multiple pages at the same time for writeback as it's unsafe o Drop the clean_page_list function. It alters timing with very little benefit. Without the contiguous writing, it doesn't do much to simplify the subsequent patches either o Throttle processes that encounter dirty pages in direct reclaim. Instead wakeup flusher threads to clean the number of pages encountered that were dirty Changelog since V1 o Merge with series that reduces stack usage in page reclaim in general o Allow memcg to writeback pages as they are not expected to overflow stack o Drop the contiguous-write patch for the moment There is a problem in the stack depth usage of page reclaim. Particularly during direct reclaim, it is possible to overflow the stack if it calls into the filesystems writepage function. This patch series begins by preventing writeback from direct reclaim. As this is a potentially large change, the last patch aims to reduce any filesystem writeback from page reclaim and depend more on background flush. The first patch in the series is a roll-up of what is currently in mmotm. It's provided for convenience of testing. Patch 2 and 3 note that it is important to distinguish between file and anon page writeback from page reclaim as they use stack to different depths. It updates the trace points and scripts appropriately noting which mmotm patch they should be merged with. Patch 4 notes that the units in the report are incorrect and fixes it. Patch 5 prevents direct reclaim writing out filesystem pages while still allowing writeback of anon pages which is in less danger of stack overflow and doesn't have something like background flush to clean the pages. For filesystem pages, flusher threads are asked to clean the number of pages encountered, the caller waits on congestion and puts the pages back on the LRU. For lumpy reclaim, the caller will wait for a time calling the flusher multiple times waiting on dirty pages to be written out before trying to reclaim the dirty pages a second time. This increases the responsibility of kswapd somewhat because it's now cleaning pages on behalf of direct reclaimers but unlike background flushers, kswapd knows what zone pages need to be cleaned from. As it is async IO, it should not cause kswapd to stall (at least until the queue is congested) but the order that pages are reclaimed on the LRU is altered. Dirty pages that would have been reclaimed by direct reclaimers are getting another lap on the LRU. The dirty pages could have been put on a dedicated list but this increased counter overhead and the number of lists and it is unclear if it is necessary. Patch 6 notes that dirty pages can still be found at the end of the LRU. If a number of them are encountered, it's reasonable to assume that a similar number of dirty pages will be discovered in the very near future as that was the dirtying pattern at the time. The patch pre-emptively kicks background flusher to clean a number of pages creating feedback from page reclaim to background flusher that is based on scanning rates. I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each machine had 3G of RAM and the CPUs were X86: Intel P4 2-core X86-64: AMD Phenom 4-core PPC64: PPC970MP Each used a single disk and the onboard IO controller. Dirty ratio was left at 20. Tests on an earlier series indicated that moving to 40 did not make much difference. The filesystem used for all tests was XFS. Five kernels are compared. traceonly-v6 is the first 4 patches of this series nodirect-v6 is the first 5 patches flushforward-v6 pre-emptively cleans pages when encountered on the LRU (patch 1-8) flushprio-v5 flags inodes with dirty pages at end of LRU (patch 1-9) The results on each test is broken up into two parts. The first part is a report based on the ftrace postprocessing script and reports on direct reclaim and kswapd activity. The second part reports what percentage of time was spent in direct reclaim, kswapd being awake and the percentage of pages scanned that were dirty. To work out the percentage of time spent in direct reclaim, I used /usr/bin/time to get the User + Sys CPU time. The stalled time was taken from the post-processing script. The total time is (User + Sys + Stall) and obviously the percentage is of stalled over total time. I am omitting the actual performance results simply because they are not interesting with very few significant changes. kernbench ========= No writeback from reclaim initiated and no performance change of significance. IOzone ====== No writeback from reclaim initiated and no performance change of significance. SysBench ======== The results were based on a read/write and as the machine is under-provisioned for the type of tests, figures are very unstable so not reported. with variances up to 15%. Part of the problem is that larger thread counts push the test into swap as the memory is insufficient and destabilises results further. I could tune for this, but it was reclaim that was important. X86 traceonly-v6 nodirect-v6 flushforward-v6 Direct reclaims 17 42 5 Direct reclaim pages scanned 3766 4809 361 Direct reclaim write file async I/O 1658 0 0 Direct reclaim write anon async I/O 0 315 3 Direct reclaim write file sync I/O 0 0 0 Direct reclaim write anon sync I/O 0 0 0 Wake kswapd requests 229080 262515 240991 Kswapd wakeups 578 646 567 Kswapd pages scanned 12822445 13646919 11443966 Kswapd reclaim write file async I/O 488806 417628 1676 Kswapd reclaim write anon async I/O 132832 143463 110880 Kswapd reclaim write file sync I/O 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 Time stalled direct reclaim (seconds) 0.10 1.48 0.00 Time kswapd awake (seconds) 1035.89 1051.81 846.99 Total pages scanned 12826211 13651728 11444327 Percentage pages scanned/written 4.86% 4.11% 0.98% User/Sys Time Running Test (seconds) 1268.94 1313.47 1251.05 Percentage Time Spent Direct Reclaim 0.01% 0.11% 0.00% Total Elapsed Time (seconds) 7669.42 8198.84 7583.72 Percentage Time kswapd Awake 13.51% 12.83% 11.17% Dirty file pages in direct reclaim on the X86 test machine were not much of a problem to begin with and the patches eliminate them as expected and time to complete the test was not negatively impacted as a result. Pre-emptively writing back a window of dirty pages when countered on the LRU makes a big difference - the number of dirty file pages encountered by kswapd was reduced by 99% and the percentage of dirty pages encountered is reduced to less than 1%, most of which were anon. X86-64 traceonly-v6 nodirect-v6 flushforward-v6 Direct reclaims 906 700 897 Direct reclaim pages scanned 161635 221601 62442 Direct reclaim write file async I/O 16881 0 0 Direct reclaim write anon async I/O 2558 562 706 Direct reclaim write file sync I/O 24 0 0 Direct reclaim write anon sync I/O 0 0 0 Wake kswapd requests 844622 688841 803158 Kswapd wakeups 1480 1466 1529 Kswapd pages scanned 16194333 16558633 15386430 Kswapd reclaim write file async I/O 460459 843545 193560 Kswapd reclaim write anon async I/O 243146 269235 210824 Kswapd reclaim write file sync I/O 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 Time stalled direct reclaim (seconds) 19.75 29.33 5.71 Time kswapd awake (seconds) 2067.45 2058.20 2108.51 Total pages scanned 16355968 16780234 15448872 Percentage pages scanned/written 4.42% 6.63% 2.62% User/Sys Time Running Test (seconds) 634.69 637.54 659.72 Percentage Time Spent Direct Reclaim 3.02% 4.40% 0.86% Total Elapsed Time (seconds) 6197.20 6234.80 6591.33 Percentage Time kswapd Awake 33.36% 33.01% 31.99% Direct reclaim of filesystem pages is eliminated as expected without an impact on time although kswapd had to write back more pages as a result. Again the full series reduces the percentage of dirtyp ages encountered while scanning and overall, there is less reclaim activity. PPC64 traceonly-v6 nodirect-v6 flushforward-v6 Direct reclaims 3378 4151 5658 Direct reclaim pages scanned 380441 267139 495713 Direct reclaim write file async I/O 35532 0 0 Direct reclaim write anon async I/O 18863 17160 30672 Direct reclaim write file sync I/O 9 0 0 Direct reclaim write anon sync I/O 0 0 2 Wake kswapd requests 1666305 1355794 1949445 Kswapd wakeups 533 509 551 Kswapd pages scanned 16206261 15447359 15524846 Kswapd reclaim write file async I/O 1690129 1749868 1152304 Kswapd reclaim write anon async I/O 121416 151389 147141 Kswapd reclaim write file sync I/O 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 Time stalled direct reclaim (seconds) 90.84 69.37 74.36 Time kswapd awake (seconds) 1932.31 1802.39 1999.15 Total pages scanned 16586702 15714498 16020559 Percentage pages scanned/written 11.25% 12.21% 8.30% User/Sys Time Running Test (seconds) 1315.49 1249.23 1314.83 Percentage Time Spent Direct Reclaim 6.46% 5.26% 5.35% Total Elapsed Time (seconds) 8581.41 7988.79 8719.56 Percentage Time kswapd Awake 22.52% 22.56% 22.93% Direct reclaim filesystem writes are eliminated of course and the percentage of dirty pages encountered is reduced. Stress HighAlloc ================ This test builds a large number of kernels simultaneously so that the total workload is 1.5 times the size of RAM. It then attempts to allocate all of RAM as huge pages. The metric is the percentage of memory allocated using load (Pass 1), a second attempt under load (Pass 2) and when the kernel compiles are finishes and the system is quiet (At Rest). The patches have little impact on the success rates. X86 traceonly-v6 nodirect-v6 flushforward-v6 Direct reclaims 555 496 677 Direct reclaim pages scanned 187498 83022 91321 Direct reclaim write file async I/O 684 0 0 Direct reclaim write anon async I/O 33869 5834 7723 Direct reclaim write file sync I/O 385 0 0 Direct reclaim write anon sync I/O 23225 428 191 Wake kswapd requests 1613 1484 1805 Kswapd wakeups 517 342 664 Kswapd pages scanned 27791653 2570033 3023077 Kswapd reclaim write file async I/O 308778 19758 345 Kswapd reclaim write anon async I/O 5232938 109227 167984 Kswapd reclaim write file sync I/O 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 Time stalled direct reclaim (seconds) 18223.83 282.49 392.66 Time kswapd awake (seconds) 15911.61 307.05 452.35 Total pages scanned 27979151 2653055 3114398 Percentage pages scanned/written 20.01% 5.10% 5.66% User/Sys Time Running Test (seconds) 2806.35 1765.22 1873.86 Percentage Time Spent Direct Reclaim 86.66% 13.80% 17.32% Total Elapsed Time (seconds) 20382.81 2383.34 2491.23 Percentage Time kswapd Awake 78.06% 12.88% 18.16% Total time running the test was massively reduced by the series and writebacks from page reclaim are reduced to almost negligible levels. The percentage of dirty pages written is much reduced but obviously remains high as there isn't an equivalent of background flushers for anon pages. X86-64 traceonly-v6 nodirect-v6 flushforward-v6 Direct reclaims 1159 1112 1066 Direct reclaim pages scanned 172491 147763 142100 Direct reclaim write file async I/O 2496 0 0 Direct reclaim write anon async I/O 32486 19527 15355 Direct reclaim write file sync I/O 1913 0 0 Direct reclaim write anon sync I/O 14434 2806 3704 Wake kswapd requests 1159 1101 1061 Kswapd wakeups 1110 827 785 Kswapd pages scanned 23467327 8064964 4873397 Kswapd reclaim write file async I/O 652531 86003 9135 Kswapd reclaim write anon async I/O 2476541 500556 205612 Kswapd reclaim write file sync I/O 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 Time stalled direct reclaim (seconds) 7906.48 1355.70 428.86 Time kswapd awake (seconds) 4263.89 1029.43 468.59 Total pages scanned 23639818 8212727 5015497 Percentage pages scanned/written 13.45% 7.41% 4.66% User/Sys Time Running Test (seconds) 2806.01 2744.46 2789.54 Percentage Time Spent Direct Reclaim 73.81% 33.06% 13.33% Total Elapsed Time (seconds) 10274.33 3705.47 2812.54 Percentage Time kswapd Awake 41.50% 27.78% 16.66% Again, the test completes far faster with the full series and fewer dirty pages are encountered. File writebacks from kswapd are reduced to negligible levels. PPC64 traceonly-v6 nodirect-v6 flushforward-v6 Direct reclaims 580 529 648 Direct reclaim pages scanned 111382 92480 106061 Direct reclaim write file async I/O 673 0 0 Direct reclaim write anon async I/O 23361 14769 15701 Direct reclaim write file sync I/O 300 0 0 Direct reclaim write anon sync I/O 12224 10106 1803 Wake kswapd requests 302 276 305 Kswapd wakeups 220 206 140 Kswapd pages scanned 10071156 7110936 3622584 Kswapd reclaim write file async I/O 261563 59626 6818 Kswapd reclaim write anon async I/O 2230514 689606 422745 Kswapd reclaim write file sync I/O 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 Time stalled direct reclaim (seconds) 5366.14 1668.51 974.11 Time kswapd awake (seconds) 5094.97 1621.02 1030.18 Total pages scanned 10182538 7203416 3728645 Percentage pages scanned/written 24.83% 10.75% 11.99% User/Sys Time Running Test (seconds) 3398.37 2615.25 2234.56 Percentage Time Spent Direct Reclaim 61.23% 38.95% 30.36% Total Elapsed Time (seconds) 6990.13 3174.43 2459.29 Percentage Time kswapd Awake 72.89% 51.06% 41.89% Again, far faster completion times with a significant reduction in the amount of dirty pages encountered. Overall the full series eliminates calling into the filesystem from page reclaim while massively reducing the number of dirty file pages encountered by page reclaim. There was a concern that no file writeback from page reclaim would cause problems and it still might but preliminary data show that the number of dirty pages encountered is so small that it's not likely to be a problem. There is ongoing work in writeback that should help further reduce the number of dirty pages encountered but the series complement rather than collide with each other so there is no merge dependency. Any objections to merging? Mel Gorman (6): vmscan: tracing: Roll up of patches currently in mmotm vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim vmscan: tracing: Correct units in post-processing script vmscan: Do not writeback filesystem pages in direct reclaim vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages .../trace/postprocess/trace-vmscan-postprocess.pl | 686 ++++++++++++++++++++ include/linux/memcontrol.h | 5 - include/linux/mmzone.h | 15 - include/trace/events/gfpflags.h | 37 + include/trace/events/kmem.h | 38 +- include/trace/events/vmscan.h | 202 ++++++ mm/memcontrol.c | 31 - mm/page_alloc.c | 2 - mm/vmscan.c | 481 ++++++++------ mm/vmstat.c | 2 - 10 files changed, 1205 insertions(+), 294 deletions(-) create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl create mode 100644 include/trace/events/gfpflags.h create mode 100644 include/trace/events/vmscan.h -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html