Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > >   Hi Fengguang,
> > > > 
> > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > other filesystems are slightly impacted.
> > > > >
> > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > > appear w/o this patch.
> > > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > the second patch (the changes in the first patch could possibly make
> > > > several writeback_wait events from one event but never could introduce new
> > > > events).
> > > > 
> > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > filesystem.
> > > > 
> > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > puzzling how all these happen..
> > > >   Yes, I don't understand it yet either...
> > > 
> > > Jan, it's obviously caused by this chunk, which is not really
> > > necessary for fixing Christoph's problem. So the easy way is to go
> > > ahead without this chunk.
> >   Yes, thanks a lot for debugging this! I'd still like to understand why
> > the hunk below is causing such a big problem to btrfs. I was looking into
> > the traces and all I could find so far was that for some reason relevant
> > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > don't understand why the inode was not queue for such a long time yet...
> > Today it's too late but I'll continue with my investigation tomorrow.
> 
> Yeah, I have exactly the same observation and puzzle..
  OK, I dug more into this and I think I found an explanation. The problem
starts at
   flush-btrfs-1-1336  [005]    20.688011: writeback_start: bdi btrfs-1:
sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic
  in the btrfs trace you sent me when we start "kupdate" style writeback
for bdi "btrfs-1". This work then blocks flusher thread upto moment:
   flush-btrfs-1-1336  [007]    45.707479: writeback_start: bdi btrfs-1:
sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic
   flush-btrfs-1-1336  [007]    45.707479: writeback_written: bdi btrfs-1:
sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic

  (i.e. for 25 seconds). The reason why this work blocks flusher thread for
so long is that btrfs has "btree inode" - essentially an inode holding
filesystem metadata and btrfs ignores any ->writepages() request for this
inode coming from kupdate style writeback. So we always try to write this
inode, make no progress, requeue inode (as it has still mapping tagged as
dirty), see that b_more_io is nonempty so we sleep for a while and then
retry. We do not include inode 257 with real dirty data into writeback
because this is kupdate style writeback and inode 257 does not have dirty
timestamp old enough. This loop would break either after 30s when inode
with data becomes old enough or - as we see above - at the moment when
btrfs decided to do transaction commit and cleaned metadata inode by it's
own methods. In either case this is far too late...
 
  So for now I don't see a better alternative than to revert to old
behavior in writeback_single_inode() as you suggested earlier. That way we
would redirty_tail() inodes which we cannot clean and thus they won't cause
livelocking of kupdate work. Longer term we might want to be more clever in
switching away from kupdate style writeback to pure background writeback
but it's not yet clear to me what the logic should be so that we give
enough preference to old inodes...

  New version of the second patch is attached.

								Honza
>From 97fb1a2f4d334787e9dd23ef58ead1cf4e439cc2 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@xxxxxxx>
Date: Thu, 8 Sep 2011 01:46:42 +0200
Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Acked-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
Signed-off-by: Jan Kara <jack@xxxxxxx>
---
 fs/fs-writeback.c |   55 ++++++++++++++++++++++++++++++----------------------
 1 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bdeb26a..b03878a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,17 +433,20 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. We may have just
+			 * run out of our writeback slice, or nfs_writepages()
+			 * sometimes bales out without doing anything, or e.g.
+			 * btrfs ignores for_kupdate writeback requests for
+			 * metadata inodes.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
 			if (wbc->nr_to_write <= 0) {
@@ -450,11 +456,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 				requeue_io(inode, wb);
 			} else {
 				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
+				 * Writeback blocked by something. Put inode
+				 * back to dirty list to prevent livelocking of
+				 * writeback.
 				 */
 				redirty_tail(inode, wb);
 			}
@@ -463,9 +467,19 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -618,12 +632,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux