On Sat, Oct 08, 2011 at 09:49:27PM +0800, Wu Fengguang wrote: > On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote: > > On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote: > > > Hi Jan, > > > > > > The test results look not good: btrfs is heavily impacted and the > > > other filesystems are slightly impacted. > > > > > > I'll send you the detailed logs in private emails (too large for the > > > mailing list). Basically I noticed many writeback_wait traces that > > > never appear w/o this patch. In the btrfs cases that see larger > > > regressions, I see large fluctuations in the writeout bandwidth and > > > long disk idle periods. It's still a bit puzzling how all these > > > happen.. > > > > Sorry I find that part of the regressions (about 2-3%) are caused by > > change of my test scripts recently. Here are the more fair compares > > and they show only regressions in btrfs and xfs: > > > > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+ > > ------------------------ ------------------------ > > 37.34 +0.8% 37.65 thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X > > 44.44 +3.4% 45.96 thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X > > 41.70 +1.0% 42.14 thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X > > 46.45 -0.3% 46.32 thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X > > 56.60 -0.3% 56.41 thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X > > 54.14 +0.9% 54.63 thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X > > 30.66 -0.7% 30.44 thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X > > 35.24 +1.6% 35.82 thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X > > 43.58 +0.5% 43.80 thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X > > 50.42 -0.6% 50.14 thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X > > 56.23 -1.0% 55.64 thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X > > 58.12 -0.5% 57.84 thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X > > 45.37 +1.4% 46.03 thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X > > 43.71 +2.2% 44.69 thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X > > 35.58 +0.5% 35.77 thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X > > 56.39 +1.4% 57.16 thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X > > 51.26 +1.5% 52.04 thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X > > 787.25 +0.7% 792.47 TOTAL > > > > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+ > > ------------------------ ------------------------ > > 44.53 -18.6% 36.23 thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X > > 55.89 -0.4% 55.64 thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X > > 51.11 +0.5% 51.35 thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X > > 41.76 -4.8% 39.77 thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X > > 48.34 -0.3% 48.18 thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X > > 52.36 -0.2% 52.26 thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X > > 31.07 -1.1% 30.74 thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X > > 55.44 -0.6% 55.09 thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X > > 47.59 -31.2% 32.74 thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X > > 428.07 -6.1% 401.99 TOTAL > > > > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+ > > ------------------------ ------------------------ > > 58.23 -82.6% 10.13 thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X > > 58.43 -80.3% 11.54 thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X > > 58.53 -79.9% 11.76 thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X > > 56.55 -31.7% 38.63 thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X > > 56.11 -30.1% 39.25 thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X > > 56.21 -18.3% 45.93 thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X > > 344.06 -54.3% 157.24 TOTAL > > > > I'm now bisecting the patches to find out the root cause. > > Current findings are, when only applying the first patch, or reduce the second > patch to the below one, the btrfs regressions are restored: And the below reduced patch is also OK: 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue4+ ------------------------ ------------------------ 58.23 -0.4% 57.98 thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X 58.43 -2.2% 57.13 thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X 58.53 -1.2% 57.83 thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X 37.34 -0.7% 37.07 thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X 44.44 +0.2% 44.52 thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X 41.70 +0.0% 41.72 thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X 46.45 -0.7% 46.10 thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X 56.60 -0.8% 56.15 thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X 54.14 +0.3% 54.33 thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X 44.53 -7.3% 41.29 thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X 55.89 +0.9% 56.39 thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X 51.11 +1.0% 51.60 thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X 56.55 -1.0% 55.97 thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X 56.11 -1.5% 55.28 thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X 56.21 -1.9% 55.16 thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X 30.66 -2.7% 29.82 thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X 35.24 -0.7% 35.00 thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X 43.58 -2.1% 42.65 thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X 50.42 -2.4% 49.21 thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X 56.23 -2.2% 55.00 thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X 58.12 -1.8% 57.08 thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X 41.76 -5.1% 39.61 thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X 48.34 -2.6% 47.06 thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X 52.36 -3.3% 50.64 thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X 45.37 +0.7% 45.70 thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X 43.71 +0.7% 44.00 thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X 35.58 +0.7% 35.82 thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X 56.39 -1.1% 55.77 thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X 51.26 -0.6% 50.94 thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X 31.07 -13.3% 26.94 thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X 55.44 +0.5% 55.72 thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X 47.59 +1.6% 48.33 thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X 1559.39 -1.4% 1537.83 TOTAL Subject: writeback: Replace some redirty_tail() calls with requeue_io() Date: Thu, 8 Sep 2011 01:46:42 +0200 From: Jan Kara <jack@xxxxxxx> Calling redirty_tail() can put off inode writeback for upto 30 seconds (or whatever dirty_expire_centisecs is). This is unnecessarily big delay in some cases and in other cases it is a really bad thing. In particular XFS tries to be nice to writeback and when ->write_inode is called for an inode with locked ilock, it just redirties the inode and returns EAGAIN. That currently causes writeback_single_inode() to redirty_tail() the inode. As contended ilock is common thing with XFS while extending files the result can be that inode writeout is put off for a really long time. Now that we have more robust busyloop prevention in wb_writeback() we can call requeue_io() in cases where quick retry is required without fear of raising CPU consumption too much. CC: Christoph Hellwig <hch@xxxxxxxxxxxxx> Acked-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> Signed-off-by: Jan Kara <jack@xxxxxxx> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> --- fs/fs-writeback.c | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-10-08 20:49:31.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-10-08 21:51:00.000000000 +0800 @@ -370,6 +370,7 @@ writeback_single_inode(struct inode *ino long nr_to_write = wbc->nr_to_write; unsigned dirty; int ret; + bool inode_written = false; assert_spin_locked(&wb->list_lock); assert_spin_locked(&inode->i_lock); @@ -434,6 +435,8 @@ writeback_single_inode(struct inode *ino /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { int err = write_inode(inode, wbc); + if (!err) + inode_written = true; if (ret == 0) ret = err; } @@ -477,9 +480,19 @@ writeback_single_inode(struct inode *ino * Filesystems can dirty the inode during writeback * operations, such as delayed allocation during * submission or metadata updates after data IO - * completion. + * completion. Also inode could have been dirtied by + * some process aggressively touching metadata. + * Finally, filesystem could just fail to write the + * inode for some reason. We have to distinguish the + * last case from the previous ones - in the last case + * we want to give the inode quick retry, in the + * other cases we want to put it back to the dirty list + * to avoid livelocking of writeback. */ - redirty_tail(inode, wb); + if (inode_written) + redirty_tail(inode, wb); + else + requeue_io(inode, wb); } else { /* * The inode is clean. At this point we either have @@ -597,10 +610,10 @@ static long writeback_sb_inodes(struct s wrote++; if (wbc.pages_skipped) { /* - * writeback is not making progress due to locked - * buffers. Skip this inode for now. + * Writeback is not making progress due to unavailable + * fs locks or similar condition. Retry in next round. */ - redirty_tail(inode, wb); + requeue_io(inode, wb); } spin_unlock(&inode->i_lock); spin_unlock(&wb->list_lock); @@ -632,12 +645,7 @@ static long __writeback_inodes_wb(struct struct super_block *sb = inode->i_sb; if (!grab_super_passive(sb)) { - /* - * grab_super_passive() may fail consistently due to - * s_umount being grabbed by someone else. Don't use - * requeue_io() to avoid busy retrying the inode/sb. - */ - redirty_tail(inode, wb); + requeue_io(inode, wb); continue; } wrote += writeback_sb_inodes(sb, wb, work); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html