Re: [PATCH 1/2] writeback: Improve busyloop prevention

Jan Kara <jack@xxxxxxx> · Wed, 19 Oct 2011 13:56:30 +0200

On Tue 18-10-11 22:35:04, Wu Fengguang wrote:
> On Tue, Oct 18, 2011 at 08:51:28AM +0800, Jan Kara wrote:
> > On Sat 15-10-11 00:28:07, Wu Fengguang wrote:
> > > On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> > > > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > > > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > > > > +	long pause = 1;
> > > > > > > > +	long max_pause = dirty_writeback_interval ?
> > > > > > > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > > > > +			   HZ;
> > > > > > > 
> > > > > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > > > > when the condition changes, we don't risk making the storage idle for
> > > > > > > too long time.
> > > > > > 
> > > > > > Yeah, the one big regression case
> > > > > > 
> > > > > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> > > > > > ------------------------  ------------------------  
> > > > > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > > > > 
> > > > > > is exactly caused by the large sleep: the attached graphs are showing
> > > > > > one period of no-progress on the number of written pages.
> > > > >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > > > > I see the writeback stalled for 20s or so which is more than
> > > > > dirty_writeback_centisecs so I think something more complicated must have
> > > > > happened.
> > > > 
> > > > I noticed that
> > > > 
> > > > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
> > > >    the dd tasks are hard blocked in balance_dirty_pages().
> > > > 
> > > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > 
> > > > 2) the flusher thread is not woken up because we test writeback_in_progress()
> > > >    in balance_dirty_pages().
> > > > 
> > > >                 if (unlikely(!writeback_in_progress(bdi)))
> > > >                         bdi_start_background_writeback(bdi);
> > > > 
> > > > Thus the flusher thread wait and wait as in below trace.
> > > > 
> > > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > >        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > >        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
> > > >        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > 
> > > >        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > >        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > >        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
> > > >        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > 
> > > >        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > >        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > >        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
> > > >        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > 
> > > It's still puzzling why dirty pages remain at 286 and does not get
> > > cleaned by either flusher threads for local XFS and NFSROOT for so
> > > long time..
> >   I was looking at this as well. So the reason why pages were not cleaned
> > by the flusher thread is that there were 2 dirty inodes and the inode with
> > dirty pages had i_dirtied_whan newer than the time when we started this
> > background writeback. Thus the running background writeback work always
> > included only the other inode which has no dirty pages but I_DIRTY_SYNC set.
> 
> Yes that's very likely, given that the background work can run for very
> long time and there are still code paths to redirty_tail() the inode.
> 
> This sounds horrible -- as time goes by, more and more inodes could be
> excluded from the background writeback due to inode->dirtied_when
> being touched undesirably.
  Yes, but it's not only about touching inode->i_dirtied_when. It can also
be the case that the inode started to be dirty only after we started the
writeback. For example in your run, I'm pretty confident from the traces
that that's what happened. This behavior can happen for a long time (I
think I introduced it by 7624ee72aa09334af072853457a5d46d9901c3f8) and is a
result of how we do livelock avoidance using older_than_this (or wb_start
previously). Just when you don't have long-running work that cannot
progress, it will be mostly hidden (only might be observable as a small
inefficiency of background writeback).

> > Apparently XFS is stubborn and refuses to write the inode although we try
> > rather hard. That is probably because dd writing to this inode is stuck in
> > balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
> > but what can we do...
> 
> Yeah, and there may well be other known or unknown long lived blocking
> cases. What we can do in VFS is to be not so picky on these conditions...
  True.

> To be frank I still like the requeue_io_wait() approach. It can
> trivially replace _all_ redity_tail() calls and hence avoid touching
> inode->dirtied_when -- keeping that time stamp sane is the one thing I
> would really like to do.
  As I pointed out above, it's not only about redirty_tail() so I belive we
should solve this problem regardless whether we use requeue_io_wait() or
other busyloop prevention - do you agree with the attached patch? If yes,
add it to your patch queue for the next merge window please.

  BTW: I think you cannot avoid _all_ redirty_tail() calls - in situation
where inode was really redirtied you need to really redirty_tail() it. But
I agree you can remove all redirty_tail() calls which are there because we
failed to make any progress with the inode.

> It also narrows down the perhaps blocked inodes and hence the need to
> do heuristic wait-and-retry to the minimal.
> 
> It does introduce one more list, which IMHO is more tolerable than the
> problems it fixed. And its max delay time can be reduced explicitly
> if necessary: when there are heavy dirtiers, it will quickly be woken
> up by balance_dirty_pages(); in other cases we can let the kupdate
> work check s_more_io_wait before aborting to ensure 5s max delay:
> 
> @@ -863,7 +863,7 @@ static long wb_check_old_data_flush(stru
>  
>         expired = wb->last_old_flush +
>                         msecs_to_jiffies(dirty_writeback_interval * 10);
> -       if (time_before(jiffies, expired))
> +       if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
>                 return 0;
>  
>         wb->last_old_flush = jiffies;
  So after seeing complications with my code I was also reconsidering your
approach. Having additional list should be OK if we document the logic of
all the lists in one place in detail. So the remaining problem I have is
the uncertainty when writeback will be retried. Your change above
guarantees one immediate retry in case we were not doing for_kupdate
writeback when filesystem refused to write the inode and the next retry in
dirty_writeback_interval which seems too late. But if we trigger kupdate
work earlier to retry blocked inodes, your method would be a viable
alternative - something like (completely untested) second attached patch?
With that I find both approaches mostly equivalent so if your passes
testing and you like it more, then I'm fine with that.
 
> >   I think the patch you suggest in the other email does not fix the above
> > scenario (although it is useful for reducing latency so I'll include it -
> > thanks for it!).  Probably you were just lucky enough not to hit it in your
> > next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when
> > we do not make any progress with writeback. Thus we allow freshly dirtied
> > inodes to be queued when we cannot make progress with the current set of
> > inodes. The resulting patch is attached.
> 
> Refreshing oldest_jif should be safe for the background work. However
> there is risk of livelock for other works, as the oldest_jif test is
> there right for preventing livelocks..
  Yes, I know, I just figured that refreshing oldest_jif only when we could
not make progress (and thus we know there is no other work to do because we
would break out of the loop in that case) is safe. But it is subtle and
actually I've realized that refreshing oldest_jif for background writeback
is enough so let's just do that.

> For one thing, it will break this code:
> 
>                 /*
>                  * Sync livelock prevention. Each inode is tagged and synced in
>                  * one shot. If still dirty, it will be redirty_tail()'ed below.
>                  * Update the dirty time to prevent enqueue and sync it again.
>                  */
>                 if ((inode->i_state & I_DIRTY) &&
>                     (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
>                         inode->dirtied_when = jiffies;
  Hum, right, but for WB_SYNC_ALL writeback filesystem should better not
refuse to writeback any inode. Other things would break horribly. And it
certainly is not an issue when we refresh the timestamp only for background
writeback.

> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 04cf3b9..6e909a9 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  	unsigned long wb_start = jiffies;
> >  	long nr_pages = work->nr_pages;
> >  	unsigned long oldest_jif;
> > -	struct inode *inode;
> >  	long progress;
> > +	long pause = 1;
> > +	long max_pause = dirty_writeback_interval ?
> > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > +			   HZ;
> 
> There seems no strong reasons to prefer dirty_writeback_interval over
> HZ. So how about use the simpler form "max_pause = HZ"?
  Well, dirty_writeback_interval actually makes some sense here. It is the
interval in which user wants flusher thread to recheck dirtiness situation.
So it defines latency user wants from flusher thread. Thus after this time
we definitely want to break out of any work which cannot progress. If
dirty_writeback_interval is shorter than 1s, we would break this by using
HZ in the above code. Also I wanted to avoid introducing another magic
value to writeback code when there is tunable which makes sense and can
be used... I can add a comment explaining this if you want.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
>From 7b559f1cea41cdba7b39138ad1637f8000e218b9 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@xxxxxxx>
Date: Wed, 19 Oct 2011 11:44:41 +0200
Subject: [PATCH] writeback: Include all dirty inodes in background writeback

Current livelock avoidance code makes background work to include only inodes
that were dirtied before background writeback has started. However background
writeback can be running for a long time and thus excluding newly dirtied
inodes can eventually exclude significant portion of dirty inodes making
background writeback inefficient. Since background writeback avoids livelocking
the flusher thread by yielding to any other work, there is no real reason why
background work should not include all dirty inodes so change the logic in
wb_writeback().

Signed-off-by: Jan Kara <jack@xxxxxxx>
---
 fs/fs-writeback.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..8314241 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -730,11 +730,17 @@ static long wb_writeback(struct bdi_writeback *wb,
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		/*
+		 * Kupdate and background works are special and we want to
+		 * include all inodes that need writing. Livelock avoidance is
+		 * handled by these works yielding to any other work so we are
+		 * safe.
+		 */
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			work->older_than_this = &oldest_jif;
-		}
+		} else if (work->for_background)
+			oldest_jif = jiffies;
 
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-- 
1.7.1

>From 595677f8efcaa0d9f675bf74a7048739323afd06 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@xxxxxxx>
Date: Wed, 19 Oct 2011 13:44:46 +0200
Subject: [PATCH] writeback: Retry kupdate work early if we need to retry some inode writeback

In case we could not do any writeback for some inodes, trigger next kupdate
work early so that writeback on these inodes is not delayed for the whole
dirty_writeback_interval.

Signed-off-by: Jan Kara <jack@xxxxxxx>
---
 fs/fs-writeback.c |   25 ++++++++++++++++++-------
 1 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8314241..e48da04 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -701,6 +701,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long oldest_jif;
 	struct inode *inode;
 	long progress;
+	long total_progress = 0;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -750,6 +751,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		else
 			progress = __writeback_inodes_wb(wb, work);
 		trace_writeback_written(wb->bdi, work);
+		total_progress += progress;
 
 		wb_update_bandwidth(wb, wb_start);
 
@@ -783,7 +785,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 	}
 	spin_unlock(&wb->list_lock);
 
-	return nr_pages - work->nr_pages;
+	return total_progress;
 }
 
 /*
@@ -845,7 +847,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 
 	expired = wb->last_old_flush +
 			msecs_to_jiffies(dirty_writeback_interval * 10);
-	if (time_before(jiffies, expired))
+	if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
 		return 0;
 
 	wb->last_old_flush = jiffies;
@@ -915,7 +917,11 @@ int bdi_writeback_thread(void *data)
 {
 	struct bdi_writeback *wb = data;
 	struct backing_dev_info *bdi = wb->bdi;
-	long pages_written;
+	long progress;
+	unsigned int pause = 1;
+	unsigned int max_pause = dirty_writeback_interval ?
+			msecs_to_jiffies(dirty_writeback_interval * 10) :
+			HZ;
 
 	current->flags |= PF_SWAPWRITE;
 	set_freezable();
@@ -935,12 +941,14 @@ int bdi_writeback_thread(void *data)
 		 */
 		del_timer(&wb->wakeup_timer);
 
-		pages_written = wb_do_writeback(wb, 0);
+		progress = wb_do_writeback(wb, 0);
 
 		trace_writeback_pages_written(pages_written);
 
-		if (pages_written)
+		if (progress) {
 			wb->last_active = jiffies;
+			pause = 1;
+		}
 
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!list_empty(&bdi->work_list) || kthread_should_stop()) {
@@ -948,8 +956,11 @@ int bdi_writeback_thread(void *data)
 			continue;
 		}
 
-		if (wb_has_dirty_io(wb) && dirty_writeback_interval)
-			schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
+		if (!list_empty(&wb->b_more_io_wait) && pause < max_pause) {
+			schedule_timeout(pause);
+			pause <<= 1;
+		} else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
+			schedule_timeout(max_pause);
 		else {
 			/*
 			 * We have nothing to do, so can go sleep without any
-- 
1.7.1