Re: [PATCH 3/6] writeback: sync expired inodes first in background writeback

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Wed, 20 Apr 2011 10:53:21 +0800

On Wed, Apr 20, 2011 at 09:21:20AM +0800, Dave Chinner wrote:
> On Tue, Apr 19, 2011 at 08:56:16PM +0800, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 05:57:40PM +0800, Jan Kara wrote:
> > > On Tue 19-04-11 17:35:23, Dave Chinner wrote:
> > > > On Tue, Apr 19, 2011 at 11:00:06AM +0800, Wu Fengguang wrote:
> > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > 
> > > > > The policy is
> > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > - enqueue all dirty inodes if there are no more expired inodes to sync
> > > > > 
> > > > > This will help reduce the number of dirty pages encountered by page
> > > > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > > > dirty pages, which are more close to the end of the LRU lists. So
> > > > > syncing older inodes first helps reducing the dirty pages reached by
> > > > > the page reclaim code.
> > > > 
> > > > Once again I think this is the wrong place to be changing writeback
> > > > policy decisions. for_background writeback only goes through
> > > > wb_writeback() and writeback_inodes_wb() (same as for_kupdate
> > > > writeback), so a decision to change from expired inodes to fresh
> > > > inodes, IMO, should be made in wb_writeback.
> > > > 
> > > > That is, for_background and for_kupdate writeback start with the
> > > > same policy (older_than_this set) to writeback expired inodes first,
> > > > then when background writeback runs out of expired inodes, it should
> > > > switch to all remaining inodes by clearing older_than_this instead
> > > > of refreshing it for the next loop.
> > >   Yes, I agree with this and my impression is that Fengguang is trying to
> > > achieve exactly this behavior.
> > > 
> > > > This keeps all the policy decisions in the one place, all using the
> > > > same (existing) mechanism, and all relatively simple to understand,
> > > > and easy to tracepoint for debugging.  Changing writeback policy
> > > > deep in the writeback stack is not a good idea as it will make
> > > > extending writeback policies in future (e.g. for cgroup awareness)
> > > > very messy.
> > >   Hmm, I see. I agree the policy decisions should be at one place if
> > > reasonably possible. Fengguang moves them from wb_writeback() to inode
> > > queueing code which looks like a logical place to me as well - there we
> > > have the largest control over what inodes do we decide to write and don't
> > > have to pass all the detailed 'instructions' down in wbc structure. So if
> > > we later want to add cgroup awareness to writeback, I imagine we just add
> > > the knowledge to inode queueing code.
> > 
> > I actually started with wb_writeback() as a natural choice, and then
> > found it much easier to do the expired-only=>all-inodes switching in
> > move_expired_inodes() since it needs to know the @b_dirty and @tmp
> > lists' emptiness to trigger the switch. It's not sane for
> > wb_writeback() to look into such details. And once you do the switch
> > part in move_expired_inodes(), the whole policy naturally follows.
> 
> Well, not really. You didn't need to modify move_expired_inodes() at
> all to implement these changes - all you needed to do was modify how
> older_than_this is configured.
> 
> writeback policy is defined by the struct writeback_control.
> move_expired_inodes() is pure mechanism. What you've done is remove
> policy from the struct wbc and moved it to move_expired_inodes(),
> which now defines both policy and mechanism.

> Furhter, this means that all the tracing that uses the struct wbc no
> no longer shows the entire writeback policy that is being worked on,
> so we lose visibility into policy decisions that writeback is
> making.

Good point! I'm convinced, visibility is a necessity for debugging the
complex writeback behaviors.

> This same change is as simple as updating wbc->older_than_this
> appropriately after the wb_writeback() call for both background and
> kupdate and leaving the lower layers untouched. It's just a policy
> change. If you thinkthe mechanism is inefficient, copy
> wbc->older_than_this to a local variable inside
> move_expired_inodes()....

Do you like something like this? (details will change a bit when
rearranging the patchset)

--- linux-next.orig/fs/fs-writeback.c	2011-04-20 10:30:47.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-20 10:40:19.000000000 +0800
@@ -660,11 +660,6 @@ static long wb_writeback(struct bdi_writ
 	long write_chunk;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -713,10 +708,17 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		if (work->for_kupdate || work->for_background) {
+			oldest_jif = jiffies -
+				msecs_to_jiffies(dirty_expire_interval * 10);
+			wbc.older_than_this = &oldest_jif;
+		}
+
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
+retry_all:
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
 			__writeback_inodes_sb(work->sb, wb, &wbc);
@@ -733,6 +735,17 @@ static long wb_writeback(struct bdi_writ
 		if (wbc.nr_to_write <= 0)
 			continue;
 		/*
+		 * No expired inode? Try all fresh ones
+		 */
+		if ((work->for_kupdate || work->for_background) &&
+		    wbc.older_than_this &&
+		    wbc.nr_to_write == write_chunk &&
+		    list_empty(&wb->b_io) &&
+		    list_empty(&wb->b_more_io)) {
+			wbc.older_than_this = NULL;
+			goto retry_all;
+		}
+		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
 		if (!wbc.more_io)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html