Re: [PATCH 5/5] vfs: Avoid unnecessary WB_SYNC_NONE writeback during sync(1)

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Thu, 28 Jul 2011 16:37:56 -0400

On Thu, Jul 28, 2011 at 07:42:03PM +0200, Jan Kara wrote:
> > be do a blocking sync_inodes_sb, but then a non-blocking ->sync_fs?
>   I tried to explain this in the changelog but it seems I failed.
> wakeup_flusher_threads(0) starts completely asynchronous writeback of all
> bdis. We could do an iteration of sb_sync_fs with wait==0 over all
> superblocks just after wakeup_flusher_threads(0) returns but at this
> moment, almost nothing is written yet because flusher threads have just
> started doing their work. So what useful work can ->sync_fs(0) do in such
> case? E.g. ocfs2 could start processing orphan list but for example
> starting quota writeback as done by XFS would be really fruitless as quota
> is going to change heavily as a result of writeback I expect. Similarly
> ext3 or ext4 cannot really do useful work while most inodes are not yet
> written.

the way xfs currently starts quota writeback actually seems pretty
useless to be already.  I don't think removing it will have a
performance impact, but I will have to benchmark it carefully.  Remember
that even right now the chances that all inodes are actually written
is fairly low on a busy systems, so we might just create a lot of
duplicate work by the two sync_fs passes.

> 
> It would be nice to call ->sync_fs with wait == 0 when flusher thread is
> done with the bdi but there's no easy way to do that. We could use the
> completion for this but we'd have to track these for every bdi which isn't
> very nice. That's why I chose to synchronize with flusher threads using
> the synchronous inode writeback - when that is finished we know flusher
> threads are done and so calling ->sync_fs() can be useful. So do you think
> calling ->sync_fs() with wait == 0 that late is a problem?

It's far too much subtile changes for one patch.  The current sync
sequence in pseudo-code is:

wakeup_flusher_threads()
for_each_sb {
	writeback_inodes_sb();
	sync_fs(nowait);
	__sync_blockdev(nowait);
}
for_each_sb {
	sync_inodes_sb();
	sync_fs(wait);
	__sync_blockdev(nowait);
}

and currently your patch changes it to:

wakeup_flusher_threads()
for_each_sb {
	sync_inodes_sb();
	sync_fs(nowait);
}
for_each_bdev 
	__sync_blockdev(nowait);
for_each_sb
	sync_fs(wait);
for_each_bdev
	__sync_blockdev(wait);

Let's do things bisectable, with one behaviour change and one goal at a
time.

That is first patch in the sequence splits the loops and moves to:

wakeup_flusher_threads()
for_each_sb
	writeback_inodes_sb();
for_each_sb
	sync_fs(nowait);
for_each_sb
	__sync_blockdev(nowait);
for_each_sb 
	sync_inodes_sb();
for_each_sb 
	sync_fs(wait);
for_each_sb 
	__sync_blockdev(nowait);

Then as next steps add one patch that moves to iterate over the block
devices for the __sync_blockdev pass, and one that drops the
writeback_inodes_sb call at which point we'll have

wakeup_flusher_threads()
for_each_sb
	sync_fs(nowait);
for_each_bdev
	__sync_blockdev(nowait);
for_each_sb 
	sync_inodes_sb();
for_each_sb 
	sync_fs(wait);
for_each_bdev
	__sync_blockdev(nowait);

And now we can thing about optimizations.  Randomly changing order
as done right now doesn't seem to helpful.  My theory would be that
we could simply drop the nowait versions of sync_fs entirely, as it
doesn't do a lot of work that gets invalidated by the inode writeback
later on, but we'd really have to validate that theory by benchmark
runs on the filesystems where people care about performance.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html