On Tue 01-11-11 21:42:31, Wu Fengguang wrote: > On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote: > > On Thu 27-10-11 14:31:33, Wu Fengguang wrote: > > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote: > > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote: > > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote: > > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote: > > > > > > > Jan, > > > > > > > > > > > > > > I tried the below combined patch over the ioless one, and find some > > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular > > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time. > > > > > > > > > > > > > > I'll try to bisect the changeset. > > > > > > > > > > This is interesting, the culprit is found to be patch 1, which is > > > > > simply > > > > > if (work->for_kupdate) { > > > > > oldest_jif = jiffies - > > > > > msecs_to_jiffies(dirty_expire_interval * 10); > > > > > - work->older_than_this = &oldest_jif; > > > > > - } > > > > > + } else if (work->for_background) > > > > > + oldest_jif = jiffies; > > > > Yeah. I had a look into the trace and you can notice that during the > > > > whole dd run, we were running a single background writeback work (you can > > > > verify that by work->nr_pages decreasing steadily). Without refreshing > > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify > > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it > > > > every 5 seconds (kjournald dirties the device inode after committing a > > > > transaction by dirtying metadata buffers which were just committed and can > > > > now be checkpointed either by kjournald or flusher thread). So although the > > > > performance is slightly reduced, I'd say that the behavior is a desired > > > > one. > > > > > > > > Also if you observed the performance on a really long run, the difference > > > > should get smaller because eventually, kjournald has to flush the metadata > > > > blocks when the journal fills up and we need to free some journal space and > > > > at that point flushing is even more expensive because we have to do a > > > > blocking write during which all transaction operations, thus effectively > > > > the whole filesystem, are blocked. > > > > > > Jan, I got figures for test case > > > > > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ > > > > > > There is no single drop of nr_writeback in the longer 1200s run, which > > > wrote ~60GB data. > > I did some calculations. Default journal size for a filesystem of your > > size is 128 MB which allows recording of around 128 GB of data. So your > > test probably didn't hit the point where the journal is recycled yet. An > > easy way to make sure journal gets recycled is to set its size to a lower > > value when creating the filesystem by > > mke2fs -J size=8 > > I tried the "-J size=8" and get similar interesting results for > ext3/4, before/after this change: > > if (work->for_kupdate) { > oldest_jif = jiffies - > msecs_to_jiffies(dirty_expire_interval * 10); > - work->older_than_this = &oldest_jif; > - } > + } else if (work->for_background) > + oldest_jif = jiffies; > > So I only attach the graphs for one case: > > ext4-1dd-4k-8p-2941M-1000M:10-3.1.0-ioless-full-next-20111025+ > > Two of the graphs are very interesting. balance_dirty_pages-pause.png > shows increasingly large negative pause times, which indicates large > delays inside some ext4's routines. Likely we are hanging waiting for transaction start. 8 MB journal puts rather big pressure on journal space so we end up waiting on kjournald a lot. But I'm not sure why wait times would increase on large scale - with ext4 it's harder to estimate used journal space because it uses extents so the amount of metadata written depends on fragmentation. If you could post ext3 graphs, maybe I could make some sense from it... > And iostat-util.png shows very large CPU utilization...Oh well the > lock_stat has the rcu_torture_timer on the top. I'd better retest > without the rcu torture test option... Yes, I guess this might be just debugging artefact. > > Then at latest after writing 8 GB the effect of journal recycling should > > be visible (I suggest writing at least 16 or so so that we can see some > > pattern). Also note that without the patch altering background writeback, > > kjournald will do all the writeback of the metadata and kjournal works with > > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will > > observe its effects only by a sudden increase in await or svctm because the > > disk got busy by IO you don't see. Also secondarily you could probably > > observe that as a hiccup in the number of dirtied/written pages. > > Ah good to know that. It could explain the drops of IO size. > > iostat should still be reporting the journal IO, is it? Yes. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html