Re: [PATCH 1/2] writeback: Improve busyloop prevention

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 18, 2011 at 08:51:28AM +0800, Jan Kara wrote:
> On Sat 15-10-11 00:28:07, Wu Fengguang wrote:
> > On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> > > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > > > +	long pause = 1;
> > > > > > > +	long max_pause = dirty_writeback_interval ?
> > > > > > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > > > +			   HZ;
> > > > > > 
> > > > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > > > when the condition changes, we don't risk making the storage idle for
> > > > > > too long time.
> > > > > 
> > > > > Yeah, the one big regression case
> > > > > 
> > > > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> > > > > ------------------------  ------------------------  
> > > > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > > > 
> > > > > is exactly caused by the large sleep: the attached graphs are showing
> > > > > one period of no-progress on the number of written pages.
> > > >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > > > I see the writeback stalled for 20s or so which is more than
> > > > dirty_writeback_centisecs so I think something more complicated must have
> > > > happened.
> > > 
> > > I noticed that
> > > 
> > > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
> > >    the dd tasks are hard blocked in balance_dirty_pages().
> > > 
> > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > 
> > > 2) the flusher thread is not woken up because we test writeback_in_progress()
> > >    in balance_dirty_pages().
> > > 
> > >                 if (unlikely(!writeback_in_progress(bdi)))
> > >                         bdi_start_background_writeback(bdi);
> > > 
> > > Thus the flusher thread wait and wait as in below trace.
> > > 
> > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > >        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > >        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
> > >        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > 
> > >        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > >        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > >        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
> > >        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > 
> > >        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > >        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > >        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
> > >        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > 
> > It's still puzzling why dirty pages remain at 286 and does not get
> > cleaned by either flusher threads for local XFS and NFSROOT for so
> > long time..
>   I was looking at this as well. So the reason why pages were not cleaned
> by the flusher thread is that there were 2 dirty inodes and the inode with
> dirty pages had i_dirtied_whan newer than the time when we started this
> background writeback. Thus the running background writeback work always
> included only the other inode which has no dirty pages but I_DIRTY_SYNC set.

Yes that's very likely, given that the background work can run for very
long time and there are still code paths to redirty_tail() the inode.

This sounds horrible -- as time goes by, more and more inodes could be
excluded from the background writeback due to inode->dirtied_when
being touched undesirably.

> Apparently XFS is stubborn and refuses to write the inode although we try
> rather hard. That is probably because dd writing to this inode is stuck in
> balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
> but what can we do...

Yeah, and there may well be other known or unknown long lived blocking
cases. What we can do in VFS is to be not so picky on these conditions...

To be frank I still like the requeue_io_wait() approach. It can
trivially replace _all_ redity_tail() calls and hence avoid touching
inode->dirtied_when -- keeping that time stamp sane is the one thing I
would really like to do.

It also narrows down the perhaps blocked inodes and hence the need to
do heuristic wait-and-retry to the minimal.

It does introduce one more list, which IMHO is more tolerable than the
problems it fixed. And its max delay time can be reduced explicitly
if necessary: when there are heavy dirtiers, it will quickly be woken
up by balance_dirty_pages(); in other cases we can let the kupdate
work check s_more_io_wait before aborting to ensure 5s max delay:

@@ -863,7 +863,7 @@ static long wb_check_old_data_flush(stru
 
        expired = wb->last_old_flush +
                        msecs_to_jiffies(dirty_writeback_interval * 10);
-       if (time_before(jiffies, expired))
+       if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
                return 0;
 
        wb->last_old_flush = jiffies;

>   I think the patch you suggest in the other email does not fix the above
> scenario (although it is useful for reducing latency so I'll include it -
> thanks for it!).  Probably you were just lucky enough not to hit it in your
> next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when
> we do not make any progress with writeback. Thus we allow freshly dirtied
> inodes to be queued when we cannot make progress with the current set of
> inodes. The resulting patch is attached.

Refreshing oldest_jif should be safe for the background work. However
there is risk of livelock for other works, as the oldest_jif test is
there right for preventing livelocks..

For one thing, it will break this code:

                /*
                 * Sync livelock prevention. Each inode is tagged and synced in
                 * one shot. If still dirty, it will be redirty_tail()'ed below.
                 * Update the dirty time to prevent enqueue and sync it again.
                 */
                if ((inode->i_state & I_DIRTY) &&
                    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
                        inode->dirtied_when = jiffies;


> From b85b7cdaf5fafec2d850304c1d0b1813c1c122a3 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@xxxxxxx>
> Date: Thu, 8 Sep 2011 01:05:25 +0200
> Subject: [PATCH 1/2] writeback: Improve busyloop prevention
> 
> Writeback of an inode can be stalled by things like internal fs locks being
> held. So in case we didn't write anything during a pass through b_io list, just
> wait for a moment and try again. Also allow newly dirtied inodes to be queued
> during the retry. When retrying is fruitless for a long time, or we have some
> other work to do, we just stop current work to avoid blocking flusher thread.
> 
> CC: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Reviewed-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> Signed-off-by: Jan Kara <jack@xxxxxxx>
> ---
>  fs/fs-writeback.c |   61 +++++++++++++++++++++++++++++++++++++++++-----------
>  1 files changed, 48 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 04cf3b9..6e909a9 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	unsigned long wb_start = jiffies;
>  	long nr_pages = work->nr_pages;
>  	unsigned long oldest_jif;
> -	struct inode *inode;
>  	long progress;
> +	long pause = 1;
> +	long max_pause = dirty_writeback_interval ?
> +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> +			   HZ;

There seems no strong reasons to prefer dirty_writeback_interval over
HZ. So how about use the simpler form "max_pause = HZ"?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux