I finally got a chance last week to visit the topic of direct reclaim avoiding the writing out pages. As it came up during discussions the last time, I also had a stab at making the VM writing ranges of pages instead of individual pages. I am not proposing for merging yet until I want to see what people think of this general direction and if we can agree on if this is the right one or not. To summarise, there are two big problems with page reclaim right now. The first is that page reclaim uses a_op->writepage to write a back back under the page lock which is inefficient from an IO perspective due to seeky patterns. The second is that direct reclaim calling the filesystem splices two potentially deep call paths together and potentially overflows the stack on complex storage or filesystems. This series is an early draft at tackling both of these problems and is in three stages. The first 4 patches are a forward-port of trace points that are partly based on trace points defined by Larry Woodman but never merged. They trace parts of kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints can be used to evaluate what is happening within reclaim and whether things are getting better or worse. They do not have to be part of the final series but might be useful during discussion. Patch 5 writes out contiguous ranges of pages where possible using a_ops->writepages. When writing a range, the inode is pinned and the page lock released before submitting to writepages(). This potentially generates a better IO pattern and it should avoid a lock inversion problem within the filesystem that wants the same page lock held by the VM. The downside with writing ranges is that the VM may not be generating more IO than necessary. Patch 6 prevents direct reclaim writing out pages at all and instead dirty pages are put back on the LRU. For lumpy reclaim, the caller will briefly wait on dirty pages to be written out before trying to reclaim the dirty pages a second time. The last patch increases the responsibility of kswapd somewhat because it's now cleaning pages on behalf of direct reclaimers but kswapd seemed a better fit than background flushers to clean pages as it knows where the pages needing cleaning are. As it's async IO, it should not cause kswapd to stall (at least until the queue is congested) but the order that pages are reclaimed on the LRU is altered. Dirty pages that would have been reclaimed by direct reclaimers are getting another lap on the LRU. The dirty pages could have been put on a dedicated list but this increased counter overhead and the number of lists and it is unclear if it is necessary. The series has survived performance and stress testing, particularly around high-order allocations on X86, X86-64 and PPC64. The results of the tests showed that while lumpy reclaim has a slightly lower success rate when allocating huge pages but it was still very acceptable rates, reclaim was a lot less disruptive and allocation latency was lower. Comments? .../trace/postprocess/trace-vmscan-postprocess.pl | 623 ++++++++++++++++++++ include/trace/events/gfpflags.h | 37 ++ include/trace/events/kmem.h | 38 +-- include/trace/events/vmscan.h | 184 ++++++ mm/vmscan.c | 299 ++++++++-- 5 files changed, 1092 insertions(+), 89 deletions(-) create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl create mode 100644 include/trace/events/gfpflags.h create mode 100644 include/trace/events/vmscan.h -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html