Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Fri, 7 Oct 2011 22:29:28 +0800

On Fri, Oct 07, 2011 at 10:22:01PM +0800, Jan Kara wrote:
> On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> > >   Great, thanks for review! I'll resend the two patches to Christoph so
> > > that he can try them.
> > 
> > Jan, I'd like to test out your updated patches with my stupid dd
> > workloads. Would you (re)send them publicly?
>   Ah, I resent them publicly on Wednesday
> (http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
> apparently does not include emails from Acked-by into list of recipients so
> you didn't get them. Sorry for that. The patches are attached for your
> convenience.

OK thanks. I only checked the linux-fsdevel list before asking..
The results should be ready tomorrow.

Thanks,
Fengguang

> From a042c2a839ad3cf89d8ee158b2bb4b94b573f578 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@xxxxxxx>
> Date: Thu, 8 Sep 2011 01:05:25 +0200
> Subject: [PATCH 1/2] writeback: Improve busyloop prevention
> 
> Writeback of an inode can be stalled by things like internal fs locks being
> held. So in case we didn't write anything during a pass through b_io list,
> just wait for a moment and try again.
> 
> CC: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Reviewed-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> Signed-off-by: Jan Kara <jack@xxxxxxx>
> ---
>  fs/fs-writeback.c |   26 ++++++++++++++------------
>  1 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 04cf3b9..bdeb26a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -699,8 +699,8 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	unsigned long wb_start = jiffies;
>  	long nr_pages = work->nr_pages;
>  	unsigned long oldest_jif;
> -	struct inode *inode;
>  	long progress;
> +	long pause = 1;
>  
>  	oldest_jif = jiffies;
>  	work->older_than_this = &oldest_jif;
> @@ -755,25 +755,27 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * mean the overall work is done. So we keep looping as long
>  		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (progress)
> +		if (progress) {
> +			pause = 1;
>  			continue;
> +		}
>  		/*
>  		 * No more inodes for IO, bail
>  		 */
>  		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
> -		 * Nothing written. Wait for some inode to
> -		 * become available for writeback. Otherwise
> -		 * we'll just busyloop.
> +		 * Nothing written (some internal fs locks were unavailable or
> +		 * inode was under writeback from balance_dirty_pages() or
> +		 * similar conditions).  Wait for a while to avoid busylooping.
>  		 */
> -		if (!list_empty(&wb->b_more_io))  {
> -			trace_writeback_wait(wb->bdi, work);
> -			inode = wb_inode(wb->b_more_io.prev);
> -			spin_lock(&inode->i_lock);
> -			inode_wait_for_writeback(inode, wb);
> -			spin_unlock(&inode->i_lock);
> -		}
> +		trace_writeback_wait(wb->bdi, work);
> +		spin_unlock(&wb->list_lock);
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +		schedule_timeout(pause);
> +		if (pause < HZ / 10)
> +			pause <<= 1;
> +		spin_lock(&wb->list_lock);
>  	}
>  	spin_unlock(&wb->list_lock);
>  
> -- 
> 1.7.1
> 

> From 0a4a2cb4d5432f5446215b1e6e44f7d83032dba3 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@xxxxxxx>
> Date: Thu, 8 Sep 2011 01:46:42 +0200
> Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
> 
> Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
> whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
> cases and in other cases it is a really bad thing. In particular XFS tries to
> be nice to writeback and when ->write_inode is called for an inode with locked
> ilock, it just redirties the inode and returns EAGAIN. That currently causes
> writeback_single_inode() to redirty_tail() the inode. As contended ilock is
> common thing with XFS while extending files the result can be that inode
> writeout is put off for a really long time.
> 
> Now that we have more robust busyloop prevention in wb_writeback() we can
> call requeue_io() in cases where quick retry is required without fear of
> raising CPU consumption too much.
> 
> CC: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Acked-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> Signed-off-by: Jan Kara <jack@xxxxxxx>
> ---
>  fs/fs-writeback.c |   61 ++++++++++++++++++++++++----------------------------
>  1 files changed, 28 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index bdeb26a..c786023 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	long nr_to_write = wbc->nr_to_write;
>  	unsigned dirty;
>  	int ret;
> +	bool inode_written = false;
>  
>  	assert_spin_locked(&wb->list_lock);
>  	assert_spin_locked(&inode->i_lock);
> @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>  		int err = write_inode(inode, wbc);
> +		if (!err)
> +			inode_written = true;
>  		if (ret == 0)
>  			ret = err;
>  	}
> @@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	if (!(inode->i_state & I_FREEING)) {
>  		/*
>  		 * Sync livelock prevention. Each inode is tagged and synced in
> -		 * one shot. If still dirty, it will be redirty_tail()'ed below.
> -		 * Update the dirty time to prevent enqueue and sync it again.
> +		 * one shot. If still dirty, update dirty time and put it back
> +		 * to dirty list to prevent enqueue and syncing it again.
>  		 */
>  		if ((inode->i_state & I_DIRTY) &&
> -		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
> +		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
>  			inode->dirtied_when = jiffies;
> -
> -		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +			redirty_tail(inode, wb);
> +		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
> -			 * We didn't write back all the pages.  nfs_writepages()
> -			 * sometimes bales out without doing anything.
> +			 * We didn't write back all the pages. nfs_writepages()
> +			 * sometimes bales out without doing anything or we
> +			 * just run our of our writeback slice.
>  			 */
>  			inode->i_state |= I_DIRTY_PAGES;
> -			if (wbc->nr_to_write <= 0) {
> -				/*
> -				 * slice used up: queue for next turn
> -				 */
> -				requeue_io(inode, wb);
> -			} else {
> -				/*
> -				 * Writeback blocked by something other than
> -				 * congestion. Delay the inode for some time to
> -				 * avoid spinning on the CPU (100% iowait)
> -				 * retrying writeback of the dirty page/inode
> -				 * that cannot be performed immediately.
> -				 */
> -				redirty_tail(inode, wb);
> -			}
> +			requeue_io(inode, wb);
>  		} else if (inode->i_state & I_DIRTY) {
>  			/*
>  			 * Filesystems can dirty the inode during writeback
>  			 * operations, such as delayed allocation during
>  			 * submission or metadata updates after data IO
> -			 * completion.
> +			 * completion. Also inode could have been dirtied by
> +			 * some process aggressively touching metadata.
> +			 * Finally, filesystem could just fail to write the
> +			 * inode for some reason. We have to distinguish the
> +			 * last case from the previous ones - in the last case
> +			 * we want to give the inode quick retry, in the
> +			 * other cases we want to put it back to the dirty list
> +			 * to avoid livelocking of writeback.
>  			 */
> -			redirty_tail(inode, wb);
> +			if (inode_written)
> +				redirty_tail(inode, wb);
> +			else
> +				requeue_io(inode, wb);
>  		} else {
>  			/*
>  			 * The inode is clean.  At this point we either have
> @@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb,
>  			wrote++;
>  		if (wbc.pages_skipped) {
>  			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> +			 * Writeback is not making progress due to unavailable
> +			 * fs locks or similar condition. Retry in next round.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);
> @@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  		struct super_block *sb = inode->i_sb;
>  
>  		if (!grab_super_passive(sb)) {
> -			/*
> -			 * grab_super_passive() may fail consistently due to
> -			 * s_umount being grabbed by someone else. Don't use
> -			 * requeue_io() to avoid busy retrying the inode/sb.
> -			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  			continue;
>  		}
>  		wrote += writeback_sb_inodes(sb, wb, work);
> -- 
> 1.7.1
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html