On Wed, Jul 6, 2011 at 1:53 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote: >> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: >> > Christoph, >> > >> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: >> > > Johannes, Mel, Wu, >> > > >> > > Dave has been stressing some XFS patches of mine that remove the XFS >> > > internal writeback clustering in favour of using write_cache_pages. >> > > >> > > As part of investigating the behaviour he found out that we're still >> > > doing lots of I/O from the end of the LRU in kswapd. Not only is that >> > > pretty bad behaviour in general, but it also means we really can't >> > > just remove the writeback clustering in writepage given how much >> > > I/O is still done through that. >> > > >> > > Any chance we could the writeback vs kswap behaviour sorted out a bit >> > > better finally? >> > >> > I once tried this approach: >> > >> > http://www.spinics.net/lists/linux-mm/msg09202.html >> > >> > It used a list structure that is not linearly scalable, however that >> > part should be independently improvable when necessary. >> >> I don't think that handing random writeback to the flusher thread is >> much better than doing random writeback directly. Yes, you added >> some clustering, but I'm still don't think writing specific pages is >> the best solution. > > I agree that the VM should avoid writing specific pages as much as > possible. Mostly often, it's indeed OK to just skip sporadically > encountered dirty page and reclaim the clean pages presumably not > far away in the LRU list. So your 2-liner patch is all good if > constraining it to low scan pressure, which will look like > > if (priority == DEF_PRIORITY) > tag PG_reclaim on encountered dirty pages and > skip writing it > > However the VM in general does need the ability to write specific > pages, such as when reclaiming from specific zone/memcg. So I'll still > propose to do bdi_start_inode_writeback(). > > Below is the patch rebased to linux-next. It's good enough for testing > purpose, and I guess even with the ->nr_pages work issue, it's > complete enough to get roughly the same performance as your 2-liner > patch. > >> > The real problem was, it seem to not very effective in my test runs. >> > I found many ->nr_pages works queued before the ->inode works, which >> > effectively makes the flusher working on more dispersed pages rather >> > than focusing on the dirty pages encountered in LRU reclaim. >> >> But that's really just an implementation issue related to how you >> tried to solve the problem. That could be addressed. >> >> However, what I'm questioning is whether we should even care what >> page memory reclaim wants to write - it seems to make fundamentally >> bad decisions from an IO persepctive. >> >> We have to remember that memory reclaim is doing LRU reclaim and the >> flusher threads are doing "oldest first" writeback. IOWs, both are trying >> to operate in the same direction (oldest to youngest) for the same >> purpose. The fundamental problem that occurs when memory reclaim >> starts writing pages back from the LRU is this: >> >> - memory reclaim has run ahead of IO writeback - >> >> The LRU usually looks like this: >> >> oldest youngest >> +---------------+---------------+--------------+ >> clean writeback dirty >> ^ ^ >> | | >> | Where flusher will next work from >> | Where kswapd is working from >> | >> IO submitted by flusher, waiting on completion >> >> >> If memory reclaim is hitting dirty pages on the LRU, it means it has >> got ahead of writeback without being throttled - it's passed over >> all the pages currently under writeback and is trying to write back >> pages that are *newer* than what writeback is working on. IOWs, it >> starts trying to do the job of the flusher threads, and it does that >> very badly. >> >> The $100 question is ∗why is it getting ahead of writeback*? > > The most important case is: faster reader + relatively slow writer. > > Assume for every 10 pages read, 1 page is dirtied, and the dirty speed > is fast enough to trigger the 20% dirty ratio and hence dirty balancing. > > That pattern is able to evenly distribute dirty pages all over the LRU > list and hence trigger lots of pageout()s. The "skip reclaim writes on > low pressure" approach can fix this case. > > Thanks, > Fengguang > --- > Subject: writeback: introduce bdi_start_inode_writeback() > Date: Thu Jul 29 14:41:19 CST 2010 > > This relays ASYNC file writeback IOs to the flusher threads. > > pageout() will continue to serve the SYNC file page writes for necessary > throttling for preventing OOM, which may happen if the LRU list is small > and/or the storage is slow, so that the flusher cannot clean enough > pages before the LRU is full scanned. > > Only ASYNC pageout() is relayed to the flusher threads, the less > frequent SYNC pageout()s will work as before as a last resort. > This helps to avoid OOM when the LRU list is small and/or the storage is > slow, and the flusher cannot clean enough pages before the LRU is > full scanned. > > The flusher will piggy back more dirty pages for IO > - it's more IO efficient > - it helps clean more pages, a good number of them may sit in the same > LRU list that is being scanned. > > To avoid memory allocations at page reclaim, a mempool is created. > > Background/periodic works will quit automatically (as done in another > patch), so as to clean the pages under reclaim ASAP. However for now the > sync work can still block us for long time. > > Jan Kara: limit the search scope. > > CC: Jan Kara <jack@xxxxxxx> > CC: Rik van Riel <riel@xxxxxxxxxx> > CC: Mel Gorman <mel@xxxxxxxxxxxxxxxxxx> > CC: Minchan Kim <minchan.kim@xxxxxxxxx> > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> It seems to be enhanced version of old Mel's done. I support this approach :) but I have some questions. > --- > fs/fs-writeback.c | 156 ++++++++++++++++++++++++++++- > include/linux/backing-dev.h | 1 > include/trace/events/writeback.h | 15 ++ > mm/vmscan.c | 8 + > 4 files changed, 174 insertions(+), 6 deletions(-) > > --- linux-next.orig/mm/vmscan.c 2011-06-29 20:43:10.000000000 -0700 > +++ linux-next/mm/vmscan.c 2011-07-05 18:30:19.000000000 -0700 > @@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st > if (PageDirty(page)) { > nr_dirty++; > > + if (page_is_file_cache(page) && mapping && > + sc->reclaim_mode != RECLAIM_MODE_SYNC) { > + if (flush_inode_page(page, mapping) >= 0) { > + SetPageReclaim(page); > + goto keep_locked; keep_locked changes old behavior. Normally, in case of async mode, we does keep_lumpy(ie, we didn't reset reclaim_mode) but now you are always resetting reclaim_mode. so sync call of shrink_page_list never happen if flush_inode_page is successful. Is it your intention? > + } > + } > + If flush_inode_page fails(ie, the page isn't nearby of current work's writeback range), we still do pageout although it's async mode. Is it your intention? -- Kind regards, Minchan Kim _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs