Hi Christoph: On Thu, Jul 28, 2011 at 1:37 PM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > On Thu, Jul 28, 2011 at 07:42:03PM +0200, Jan Kara wrote: >> > be do a blocking sync_inodes_sb, but then a non-blocking ->sync_fs? >> I tried to explain this in the changelog but it seems I failed. >> wakeup_flusher_threads(0) starts completely asynchronous writeback of all >> bdis. We could do an iteration of sb_sync_fs with wait==0 over all >> superblocks just after wakeup_flusher_threads(0) returns but at this >> moment, almost nothing is written yet because flusher threads have just >> started doing their work. So what useful work can ->sync_fs(0) do in such >> case? E.g. ocfs2 could start processing orphan list but for example >> starting quota writeback as done by XFS would be really fruitless as quota >> is going to change heavily as a result of writeback I expect. Similarly >> ext3 or ext4 cannot really do useful work while most inodes are not yet >> written. > > the way xfs currently starts quota writeback actually seems pretty > useless to be already. I don't think removing it will have a > performance impact, but I will have to benchmark it carefully. Remember > that even right now the chances that all inodes are actually written > is fairly low on a busy systems, so we might just create a lot of > duplicate work by the two sync_fs passes. > >> >> It would be nice to call ->sync_fs with wait == 0 when flusher thread is >> done with the bdi but there's no easy way to do that. We could use the >> completion for this but we'd have to track these for every bdi which isn't >> very nice. That's why I chose to synchronize with flusher threads using >> the synchronous inode writeback - when that is finished we know flusher >> threads are done and so calling ->sync_fs() can be useful. So do you think >> calling ->sync_fs() with wait == 0 that late is a problem? > > It's far too much subtile changes for one patch. The current sync > sequence in pseudo-code is: > > > wakeup_flusher_threads() > for_each_sb { > writeback_inodes_sb(); > sync_fs(nowait); > __sync_blockdev(nowait); > } > for_each_sb { > sync_inodes_sb(); > sync_fs(wait); > __sync_blockdev(nowait); Do you mean "__sync_blockdev(wait)" here? 'Cause that's what __sync_filesystem() does if wait=1. I'd have just chalked it up to a typo, but you repeated it everywhere below too, and wanted to see if I'm missing something. Curt > } > > and currently your patch changes it to: > > wakeup_flusher_threads() > for_each_sb { > sync_inodes_sb(); > sync_fs(nowait); > } > for_each_bdev > __sync_blockdev(nowait); > for_each_sb > sync_fs(wait); > for_each_bdev > __sync_blockdev(wait); > > > Let's do things bisectable, with one behaviour change and one goal at a > time. > > That is first patch in the sequence splits the loops and moves to: > > wakeup_flusher_threads() > for_each_sb > writeback_inodes_sb(); > for_each_sb > sync_fs(nowait); > for_each_sb > __sync_blockdev(nowait); > for_each_sb > sync_inodes_sb(); > for_each_sb > sync_fs(wait); > for_each_sb > __sync_blockdev(nowait); > > Then as next steps add one patch that moves to iterate over the block > devices for the __sync_blockdev pass, and one that drops the > writeback_inodes_sb call at which point we'll have > > wakeup_flusher_threads() > for_each_sb > sync_fs(nowait); > for_each_bdev > __sync_blockdev(nowait); > for_each_sb > sync_inodes_sb(); > for_each_sb > sync_fs(wait); > for_each_bdev > __sync_blockdev(nowait); > > And now we can thing about optimizations. Randomly changing order > as done right now doesn't seem to helpful. My theory would be that > we could simply drop the nowait versions of sync_fs entirely, as it > doesn't do a lot of work that gets invalidated by the inode writeback > later on, but we'd really have to validate that theory by benchmark > runs on the filesystems where people care about performance. > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html