On Wed, Nov 09, 2011 at 07:52:07AM +0800, Jan Kara wrote: > On Fri 04-11-11 23:20:55, Wu Fengguang wrote: > > On Thu, Nov 03, 2011 at 09:51:36AM +0800, Jan Kara wrote: > > > On Thu 03-11-11 02:56:03, Wu Fengguang wrote: > > > > On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote: > > > > > On Thu 27-10-11 14:31:33, Wu Fengguang wrote: > > > > > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote: > > > > > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote: > > > > > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote: > > > > > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote: > > > > > > > > > > Jan, > > > > > > > > > > > > > > > > > > > > I tried the below combined patch over the ioless one, and find some > > > > > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular > > > > > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time. > > > > > > > > > > > > > > > > > > > > I'll try to bisect the changeset. > > > > > > > > > > > > > > > > This is interesting, the culprit is found to be patch 1, which is > > > > > > > > simply > > > > > > > > if (work->for_kupdate) { > > > > > > > > oldest_jif = jiffies - > > > > > > > > msecs_to_jiffies(dirty_expire_interval * 10); > > > > > > > > - work->older_than_this = &oldest_jif; > > > > > > > > - } > > > > > > > > + } else if (work->for_background) > > > > > > > > + oldest_jif = jiffies; > > > > > > > Yeah. I had a look into the trace and you can notice that during the > > > > > > > whole dd run, we were running a single background writeback work (you can > > > > > > > verify that by work->nr_pages decreasing steadily). Without refreshing > > > > > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify > > > > > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it > > > > > > > every 5 seconds (kjournald dirties the device inode after committing a > > > > > > > transaction by dirtying metadata buffers which were just committed and can > > > > > > > now be checkpointed either by kjournald or flusher thread). So although the > > > > > > > performance is slightly reduced, I'd say that the behavior is a desired > > > > > > > one. > > > > > > > > > > > > > > Also if you observed the performance on a really long run, the difference > > > > > > > should get smaller because eventually, kjournald has to flush the metadata > > > > > > > blocks when the journal fills up and we need to free some journal space and > > > > > > > at that point flushing is even more expensive because we have to do a > > > > > > > blocking write during which all transaction operations, thus effectively > > > > > > > the whole filesystem, are blocked. > > > > > > > > > > > > Jan, I got figures for test case > > > > > > > > > > > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ > > > > > > > > > > > > There is no single drop of nr_writeback in the longer 1200s run, which > > > > > > wrote ~60GB data. > > > > > I did some calculations. Default journal size for a filesystem of your > > > > > size is 128 MB which allows recording of around 128 GB of data. So your > > > > > test probably didn't hit the point where the journal is recycled yet. An > > > > > easy way to make sure journal gets recycled is to set its size to a lower > > > > > value when creating the filesystem by > > > > > mke2fs -J size=8 > > > > > > > > > > Then at latest after writing 8 GB the effect of journal recycling should > > > > > be visible (I suggest writing at least 16 or so so that we can see some > > > > > pattern). Also note that without the patch altering background writeback, > > > > > kjournald will do all the writeback of the metadata and kjournal works with > > > > > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will > > > > > observe its effects only by a sudden increase in await or svctm because the > > > > > disk got busy by IO you don't see. Also secondarily you could probably > > > > > observe that as a hiccup in the number of dirtied/written pages. > > > > > > > > Jan, finally the `correct' results for "-J size=8" w/o the patch > > > > altering background writeback. > > > > > > > > I noticed the periodic small drops of nr_writeback in > > > > global_dirty_state.png, other than that it looks pretty good. > > > If you look at iostat graphs, you'll notice periodic increases in await > > > time in roughly 100 s intervals. I belive this could be checkpointing > > > that's going on in the background. Also there are (negative) peaks in the > > > "paused" graph. Anyway, the main question is - do you see any throughput > > > difference with/without the background writeback patch with the small > > > journal? > > > > Jan, I got the results before/after patch -- there is small > > performance drops either with plain mkfs or mkfs "-J size=8", > > while the latter does see smaller drops. > > > > To make it more accurate, I use the average wkB/s value reported by > > iostat for the comparison. > > > > wfg@bee /export/writeback% ./compare.rb -g jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+ > > 3.1.0-ioless-full-next-20111102+ 3.1.0-ioless-full-bg-all-next-20111102+ > > ------------------------ ------------------------ > > 35659.34 -0.8% 35377.54 thresh=1000M/ext3:jsize=8-100dd-4k-8p-4096M-1000M:10-X > > 38564.52 -1.9% 37839.55 thresh=1000M/ext3:jsize=8-10dd-4k-8p-4096M-1000M:10-X > > 46213.55 -3.1% 44784.05 thresh=1000M/ext3:jsize=8-1dd-4k-8p-4096M-1000M:10-X > > 47546.62 +0.5% 47790.81 thresh=1000M/ext4:jsize=8-100dd-4k-8p-4096M-1000M:10-X > > 53166.76 +0.6% 53512.28 thresh=1000M/ext4:jsize=8-10dd-4k-8p-4096M-1000M:10-X > > 55657.48 -0.2% 55530.27 thresh=1000M/ext4:jsize=8-1dd-4k-8p-4096M-1000M:10-X > > 38868.18 -1.9% 38146.89 thresh=100M/ext3:jsize=8-10dd-4k-8p-4096M-100M:10-X > > 46023.21 -0.2% 45908.73 thresh=100M/ext3:jsize=8-1dd-4k-8p-4096M-100M:10-X > > 42182.84 -1.5% 41556.99 thresh=100M/ext3:jsize=8-2dd-4k-8p-4096M-100M:10-X > > 45443.23 -0.9% 45038.84 thresh=100M/ext4:jsize=8-10dd-4k-8p-4096M-100M:10-X > > 53801.15 -0.9% 53315.74 thresh=100M/ext4:jsize=8-1dd-4k-8p-4096M-100M:10-X > > 52207.05 -0.6% 51913.22 thresh=100M/ext4:jsize=8-2dd-4k-8p-4096M-100M:10-X > > 33389.88 -3.5% 32226.18 thresh=10M/ext3:jsize=8-10dd-4k-8p-4096M-10M:10-X > > 45430.23 -3.5% 43846.57 thresh=10M/ext3:jsize=8-1dd-4k-8p-4096M-10M:10-X > > 44186.72 -4.5% 42185.16 thresh=10M/ext3:jsize=8-2dd-4k-8p-4096M-10M:10-X > > 36237.34 -3.1% 35128.90 thresh=10M/ext4:jsize=8-10dd-4k-8p-4096M-10M:10-X > > 54633.30 -2.7% 53135.13 thresh=10M/ext4:jsize=8-1dd-4k-8p-4096M-10M:10-X > > 50767.63 -1.9% 49800.59 thresh=10M/ext4:jsize=8-2dd-4k-8p-4096M-10M:10-X > > 49654.38 -4.8% 47274.27 thresh=1M/ext4:jsize=8-1dd-4k-8p-4096M-1M:10-X > > 45142.01 -5.3% 42745.49 thresh=1M/ext4:jsize=8-2dd-4k-8p-4096M-1M:10-X > > 914775.42 -1.9% 897057.21 TOTAL io_wkB_s > These differences look negligible unless thresh <= 10M when flushing > becomes rather aggressive I'd say and thus the fact that background > writeback can switch inodes is more noticeable. OTOH thresh <= 10M doesn't > look like a case which needs optimizing for. Agreed in principle. > > wfg@bee /export/writeback% ./compare.rb -v jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+ > > 3.1.0-ioless-full-next-20111102+ 3.1.0-ioless-full-bg-all-next-20111102+ > > ------------------------ ------------------------ > > 36231.89 -3.8% 34855.10 thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X > > 41115.07 -12.7% 35886.36 thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X > > 48025.75 -14.3% 41146.57 thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X > > 47684.35 -6.4% 44644.30 thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X > > 54015.86 -4.0% 51851.01 thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X > > 55320.03 -2.6% 53867.63 thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X > > 37400.51 +1.6% 38012.57 thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X > > 45317.31 -4.5% 43272.16 thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X > > 40552.64 +0.8% 40884.60 thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X > > 44271.29 -5.6% 41789.76 thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X > > 54334.22 -3.5% 52435.69 thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X > > 52563.67 -6.1% 49341.84 thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X > > 45027.95 -1.0% 44599.37 thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X > > 42478.40 +0.3% 42608.48 thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X > > 35178.47 -0.2% 35103.56 thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X > > 54079.64 -0.5% 53834.85 thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X > > 49982.11 -0.4% 49803.44 thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X > > 783579.17 -3.8% 753937.28 TOTAL io_wkB_s > Here I can see some noticeable drops in the realistic thresh=100M case > (case thresh=1000M is unrealistic but it still surprise me that there are > drops as well). I'll try to reproduce your results so that I can look into > this more effectively. OK. I'm trying to bring out the test scripts in a useful way, so as to make it easier for you to do comparison/analyzes more freely :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html