Re: [PATCH 1/2] writeback: Improve busyloop prevention

Jan Kara <jack@xxxxxxx> · Tue, 18 Oct 2011 02:51:28 +0200

On Sat 15-10-11 00:28:07, Wu Fengguang wrote:
> On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > > +	long pause = 1;
> > > > > > +	long max_pause = dirty_writeback_interval ?
> > > > > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > > +			   HZ;
> > > > > 
> > > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > > when the condition changes, we don't risk making the storage idle for
> > > > > too long time.
> > > > 
> > > > Yeah, the one big regression case
> > > > 
> > > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> > > > ------------------------  ------------------------  
> > > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > > 
> > > > is exactly caused by the large sleep: the attached graphs are showing
> > > > one period of no-progress on the number of written pages.
> > >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > > I see the writeback stalled for 20s or so which is more than
> > > dirty_writeback_centisecs so I think something more complicated must have
> > > happened.
> > 
> > I noticed that
> > 
> > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
> >    the dd tasks are hard blocked in balance_dirty_pages().
> > 
> >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > 
> > 2) the flusher thread is not woken up because we test writeback_in_progress()
> >    in balance_dirty_pages().
> > 
> >                 if (unlikely(!writeback_in_progress(bdi)))
> >                         bdi_start_background_writeback(bdi);
> > 
> > Thus the flusher thread wait and wait as in below trace.
> > 
> >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> >        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> >        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
> >        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > 
> >        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> >        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> >        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
> >        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > 
> >        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> >        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> >        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
> >        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> 
> It's still puzzling why dirty pages remain at 286 and does not get
> cleaned by either flusher threads for local XFS and NFSROOT for so
> long time..
  I was looking at this as well. So the reason why pages were not cleaned
by the flusher thread is that there were 2 dirty inodes and the inode with
dirty pages had i_dirtied_whan newer than the time when we started this
background writeback. Thus the running background writeback work always
included only the other inode which has no dirty pages but I_DIRTY_SYNC set.
Apparently XFS is stubborn and refuses to write the inode although we try
rather hard. That is probably because dd writing to this inode is stuck in
balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
but what can we do...

  I think the patch you suggest in the other email does not fix the above
scenario (although it is useful for reducing latency so I'll include it -
thanks for it!).  Probably you were just lucky enough not to hit it in your
next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when
we do not make any progress with writeback. Thus we allow freshly dirtied
inodes to be queued when we cannot make progress with the current set of
inodes. The resulting patch is attached.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
>From b85b7cdaf5fafec2d850304c1d0b1813c1c122a3 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@xxxxxxx>
Date: Thu, 8 Sep 2011 01:05:25 +0200
Subject: [PATCH 1/2] writeback: Improve busyloop prevention

Writeback of an inode can be stalled by things like internal fs locks being
held. So in case we didn't write anything during a pass through b_io list, just
wait for a moment and try again. Also allow newly dirtied inodes to be queued
during the retry. When retrying is fruitless for a long time, or we have some
other work to do, we just stop current work to avoid blocking flusher thread.

CC: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Reviewed-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
Signed-off-by: Jan Kara <jack@xxxxxxx>
---
 fs/fs-writeback.c |   61 +++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..6e909a9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	struct inode *inode;
 	long progress;
+	long pause = 1;
+	long max_pause = dirty_writeback_interval ?
+			   msecs_to_jiffies(dirty_writeback_interval * 10) :
+			   HZ;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -730,11 +733,19 @@ static long wb_writeback(struct bdi_writeback *wb,
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		/*
+		 * Refresh oldest timestamp. In kupdate style we include
+		 * freshly expired inodes in each round. For other types of
+		 * writeback we don't want to include newly dirtied inodes
+		 * to avoid livelocking. But in case we made no progress in
+		 * the last writeback round, allow new inodes to be queued
+		 * so that we don't block flusher thread unnecessarily.
+		 */
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			work->older_than_this = &oldest_jif;
-		}
+		} else if (pause > 1)
+			oldest_jif = jiffies;
 
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
@@ -755,25 +766,49 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			pause = 1;
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
 		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
+		 * Nothing written (some internal fs locks were unavailable or
+		 * inode was under writeback from balance_dirty_pages() or
+		 * similar conditions).
+		 *
+		 * Wait for a while to avoid busylooping unless we waited for
+		 * so long it does not make sense to retry anymore.
 		 */
-		if (!list_empty(&wb->b_more_io))  {
-			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
+		if (pause > max_pause)
+			break;
+		/*
+		 * Set state here to prevent races with someone waking us up
+		 * (because new work is queued or because background limit is
+		 * exceeded).
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* If there's some other work to do, proceed with it... */
+		if (!list_empty(&wb->bdi->work_list) ||
+		    (!work->for_background && over_bground_thresh())) {
+			__set_current_state(TASK_RUNNING);
+			break;
 		}
+		trace_writeback_wait(wb->bdi, work);
+		spin_unlock(&wb->list_lock);
+		/*
+		 * Clear writeback_running so that we properly indicate that
+		 * writeback is currently stalled
+		 */
+		clear_bit(BDI_writeback_running, &wb->bdi->state);
+		schedule_timeout(pause);
+		set_bit(BDI_writeback_running, &wb->bdi->state);
+		if (pause < max_pause)
+			pause <<= 1;
+		spin_lock(&wb->list_lock);
 	}
 	spin_unlock(&wb->list_lock);
 
-- 
1.7.1