On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote: > On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote: > > Hi Jan, > > > > The test results look not good: btrfs is heavily impacted and the > > other filesystems are slightly impacted. > > > > I'll send you the detailed logs in private emails (too large for the > > mailing list). Basically I noticed many writeback_wait traces that > > never appear w/o this patch. In the btrfs cases that see larger > > regressions, I see large fluctuations in the writeout bandwidth and > > long disk idle periods. It's still a bit puzzling how all these > > happen.. > > Sorry I find that part of the regressions (about 2-3%) are caused by > change of my test scripts recently. Here are the more fair compares > and they show only regressions in btrfs and xfs: > > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+ > ------------------------ ------------------------ > 37.34 +0.8% 37.65 thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X > 44.44 +3.4% 45.96 thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X > 41.70 +1.0% 42.14 thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X > 46.45 -0.3% 46.32 thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X > 56.60 -0.3% 56.41 thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X > 54.14 +0.9% 54.63 thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X > 30.66 -0.7% 30.44 thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X > 35.24 +1.6% 35.82 thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X > 43.58 +0.5% 43.80 thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X > 50.42 -0.6% 50.14 thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X > 56.23 -1.0% 55.64 thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X > 58.12 -0.5% 57.84 thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X > 45.37 +1.4% 46.03 thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X > 43.71 +2.2% 44.69 thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X > 35.58 +0.5% 35.77 thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X > 56.39 +1.4% 57.16 thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X > 51.26 +1.5% 52.04 thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X > 787.25 +0.7% 792.47 TOTAL > > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+ > ------------------------ ------------------------ > 44.53 -18.6% 36.23 thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X > 55.89 -0.4% 55.64 thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X > 51.11 +0.5% 51.35 thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X > 41.76 -4.8% 39.77 thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X > 48.34 -0.3% 48.18 thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X > 52.36 -0.2% 52.26 thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X > 31.07 -1.1% 30.74 thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X > 55.44 -0.6% 55.09 thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X > 47.59 -31.2% 32.74 thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X > 428.07 -6.1% 401.99 TOTAL > > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+ > ------------------------ ------------------------ > 58.23 -82.6% 10.13 thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X > 58.43 -80.3% 11.54 thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X > 58.53 -79.9% 11.76 thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X > 56.55 -31.7% 38.63 thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X > 56.11 -30.1% 39.25 thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X > 56.21 -18.3% 45.93 thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X > 344.06 -54.3% 157.24 TOTAL > > I'm now bisecting the patches to find out the root cause. Current findings are, when only applying the first patch, or reduce the second patch to the below one, the btrfs regressions are restored: 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue2+ ------------------------ ------------------------ 58.23 -0.3% 58.06 thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X 58.43 -0.4% 58.19 thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X 58.53 -0.5% 58.25 thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X 56.55 -0.4% 56.30 thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X 56.11 +0.1% 56.19 thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X 56.21 -0.2% 56.12 thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X 50.42 -2.1% 49.36 thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X 56.23 -2.2% 55.00 thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X 58.12 -2.2% 56.82 thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X 41.76 +1.6% 42.42 thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X 48.34 -1.0% 47.85 thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X 52.36 -1.5% 51.57 thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X 651.29 -0.8% 646.12 TOTAL 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue3+ ------------------------ ------------------------ 56.55 -3.3% 54.70 thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X 56.11 -0.4% 55.91 thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X 56.21 +0.7% 56.58 thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X 168.87 -1.0% 167.20 TOTAL --- linux-next.orig/fs/fs-writeback.c 2011-10-08 20:49:31.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-10-08 20:51:22.000000000 +0800 @@ -370,6 +370,7 @@ writeback_single_inode(struct inode *ino long nr_to_write = wbc->nr_to_write; unsigned dirty; int ret; + bool inode_written = false; assert_spin_locked(&wb->list_lock); assert_spin_locked(&inode->i_lock); @@ -434,6 +435,8 @@ writeback_single_inode(struct inode *ino /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { int err = write_inode(inode, wbc); + if (!err) + inode_written = true; if (ret == 0) ret = err; } @@ -477,9 +480,19 @@ writeback_single_inode(struct inode *ino * Filesystems can dirty the inode during writeback * operations, such as delayed allocation during * submission or metadata updates after data IO - * completion. + * completion. Also inode could have been dirtied by + * some process aggressively touching metadata. + * Finally, filesystem could just fail to write the + * inode for some reason. We have to distinguish the + * last case from the previous ones - in the last case + * we want to give the inode quick retry, in the + * other cases we want to put it back to the dirty list + * to avoid livelocking of writeback. */ - redirty_tail(inode, wb); + if (inode_written) + redirty_tail(inode, wb); + else + requeue_io(inode, wb); } else { /* * The inode is clean. At this point we either have -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html