On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote: > On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote: > > On Tue, Apr 19, 2011 at 05:57:40PM +0800, Jan Kara wrote: > > > On Tue 19-04-11 17:35:23, Dave Chinner wrote: > > > > On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote: > > > > > A background flush work may run for ever. So it's reasonable for it to > > > > > mimic the kupdate behavior of syncing old/expired inodes first. > > > > > > > > > > The policy is > > > > > - enqueue all newly expired inodes at each queue_io() time > > > > > - enqueue all dirty inodes if there are no more expired inodes to sync > > > > > > > > > > This will help reduce the number of dirty pages encountered by page > > > > > reclaim, eg. the pageout() calls. Normally older inodes contain older > > > > > dirty pages, which are more close to the end of the LRU lists. So > > > > > syncing older inodes first helps reducing the dirty pages reached by > > > > > the page reclaim code. > > > > > > > > Once again I think this is the wrong place to be changing writeback > > > > policy decisions. for_background writeback only goes through > > > > wb_writeback() and writeback_inodes_wb() (same as for_kupdate > > > > writeback), so a decision to change from expired inodes to fresh > > > > inodes, IMO, should be made in wb_writeback. > > > > > > > > That is, for_background and for_kupdate writeback start with the > > > > same policy (older_than_this set) to writeback expired inodes first, > > > > then when background writeback runs out of expired inodes, it should > > > > switch to all remaining inodes by clearing older_than_this instead > > > > of refreshing it for the next loop. > > > Yes, I agree with this and my impression is that Fengguang is trying to > > > achieve exactly this behavior. > > > > > > > This keeps all the policy decisions in the one place, all using the > > > > same (existing) mechanism, and all relatively simple to understand, > > > > and easy to tracepoint for debugging. Changing writeback policy > > > > deep in the writeback stack is not a good idea as it will make > > > > extending writeback policies in future (e.g. for cgroup awareness) > > > > very messy. > > > Hmm, I see. I agree the policy decisions should be at one place if > > > reasonably possible. Fengguang moves them from wb_writeback() to inode > > > queueing code which looks like a logical place to me as well - there we > > > have the largest control over what inodes do we decide to write and don't > > > have to pass all the detailed 'instructions' down in wbc structure. So if > > > we later want to add cgroup awareness to writeback, I imagine we just add > > > the knowledge to inode queueing code. > > > > I actually started with wb_writeback() as a natural choice, and then > > found it much easier to do the expired-only=>all-inodes switching in > > move_expired_inodes() since it needs to know the @b_dirty and @tmp > > lists' emptiness to trigger the switch. It's not sane for > > wb_writeback() to look into such details. And once you do the switch > > part in move_expired_inodes(), the whole policy naturally follows. > > Well, not really. You didn't need to modify move_expired_inodes() at > all to implement these changes - all you needed to do was modify how > older_than_this is configured. > > writeback policy is defined by the struct writeback_control. > move_expired_inodes() is pure mechanism. What you've done is remove > policy from the struct wbc and moved it to move_expired_inodes(), > which now defines both policy and mechanism. > Furhter, this means that all the tracing that uses the struct wbc no > no longer shows the entire writeback policy that is being worked on, > so we lose visibility into policy decisions that writeback is > making. Good point! I'm convinced, visibility is a necessity for debugging the complex writeback behaviors. > This same change is as simple as updating wbc->older_than_this > appropriately after the wb_writeback() call for both background and > kupdate and leaving the lower layers untouched. It's just a policy > change. If you thinkthe mechanism is inefficient, copy > wbc->older_than_this to a local variable inside > move_expired_inodes().... Do you like something like this? (details will change a bit when rearranging the patchset) --- linux-next.orig/fs/fs-writeback.c 2011-04-20 10:30:47.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-04-20 10:40:19.000000000 +0800 @@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ long write_chunk; struct inode *inode; - if (wbc.for_kupdate) { - wbc.older_than_this = &oldest_jif; - oldest_jif = jiffies - - msecs_to_jiffies(dirty_expire_interval * 10); - } if (!wbc.range_cyclic) { wbc.range_start = 0; wbc.range_end = LLONG_MAX; @@ -713,10 +708,17 @@ static long wb_writeback(struct bdi_writ if (work->for_background && !over_bground_thresh()) break; + if (work->for_kupdate || work->for_background) { + oldest_jif = jiffies - + msecs_to_jiffies(dirty_expire_interval * 10); + wbc.older_than_this = &oldest_jif; + } + wbc.more_io = 0; wbc.nr_to_write = write_chunk; wbc.pages_skipped = 0; +retry_all: trace_wbc_writeback_start(&wbc, wb->bdi); if (work->sb) __writeback_inodes_sb(work->sb, wb, &wbc); @@ -733,6 +735,17 @@ static long wb_writeback(struct bdi_writ if (wbc.nr_to_write <= 0) continue; /* + * No expired inode? Try all fresh ones + */ + if ((work->for_kupdate || work->for_background) && + wbc.older_than_this && + wbc.nr_to_write == write_chunk && + list_empty(&wb->b_io) && + list_empty(&wb->b_more_io)) { + wbc.older_than_this = NULL; + goto retry_all; + } + /* * Didn't write everything and we don't have more IO, bail */ if (!wbc.more_io) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html