On Thu, Jul 28, 2011 at 07:42:03PM +0200, Jan Kara wrote: > > be do a blocking sync_inodes_sb, but then a non-blocking ->sync_fs? > I tried to explain this in the changelog but it seems I failed. > wakeup_flusher_threads(0) starts completely asynchronous writeback of all > bdis. We could do an iteration of sb_sync_fs with wait==0 over all > superblocks just after wakeup_flusher_threads(0) returns but at this > moment, almost nothing is written yet because flusher threads have just > started doing their work. So what useful work can ->sync_fs(0) do in such > case? E.g. ocfs2 could start processing orphan list but for example > starting quota writeback as done by XFS would be really fruitless as quota > is going to change heavily as a result of writeback I expect. Similarly > ext3 or ext4 cannot really do useful work while most inodes are not yet > written. the way xfs currently starts quota writeback actually seems pretty useless to be already. I don't think removing it will have a performance impact, but I will have to benchmark it carefully. Remember that even right now the chances that all inodes are actually written is fairly low on a busy systems, so we might just create a lot of duplicate work by the two sync_fs passes. > > It would be nice to call ->sync_fs with wait == 0 when flusher thread is > done with the bdi but there's no easy way to do that. We could use the > completion for this but we'd have to track these for every bdi which isn't > very nice. That's why I chose to synchronize with flusher threads using > the synchronous inode writeback - when that is finished we know flusher > threads are done and so calling ->sync_fs() can be useful. So do you think > calling ->sync_fs() with wait == 0 that late is a problem? It's far too much subtile changes for one patch. The current sync sequence in pseudo-code is: wakeup_flusher_threads() for_each_sb { writeback_inodes_sb(); sync_fs(nowait); __sync_blockdev(nowait); } for_each_sb { sync_inodes_sb(); sync_fs(wait); __sync_blockdev(nowait); } and currently your patch changes it to: wakeup_flusher_threads() for_each_sb { sync_inodes_sb(); sync_fs(nowait); } for_each_bdev __sync_blockdev(nowait); for_each_sb sync_fs(wait); for_each_bdev __sync_blockdev(wait); Let's do things bisectable, with one behaviour change and one goal at a time. That is first patch in the sequence splits the loops and moves to: wakeup_flusher_threads() for_each_sb writeback_inodes_sb(); for_each_sb sync_fs(nowait); for_each_sb __sync_blockdev(nowait); for_each_sb sync_inodes_sb(); for_each_sb sync_fs(wait); for_each_sb __sync_blockdev(nowait); Then as next steps add one patch that moves to iterate over the block devices for the __sync_blockdev pass, and one that drops the writeback_inodes_sb call at which point we'll have wakeup_flusher_threads() for_each_sb sync_fs(nowait); for_each_bdev __sync_blockdev(nowait); for_each_sb sync_inodes_sb(); for_each_sb sync_fs(wait); for_each_bdev __sync_blockdev(nowait); And now we can thing about optimizations. Randomly changing order as done right now doesn't seem to helpful. My theory would be that we could simply drop the nowait versions of sync_fs entirely, as it doesn't do a lot of work that gets invalidated by the inode writeback later on, but we'd really have to validate that theory by benchmark runs on the filesystems where people care about performance. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html