On Thu 03-09-09 00:18:38, Christoph Hellwig wrote: > I think we should add this one ontop: Agreed. Added to the series. I'll now push the whole series to linux-next via my fs tree. Honza > > -- > Subject: [PATCH] fsync: wait for data writeout completion before calling ->fsync > From: Christoph Hellwig <hch@xxxxxx> > > Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out > the data, the calls into ->fsync to write out the metadata and then finally > calls filemap_fdatawait to wait for the data I/O to complete. What sounds > like a clever micro-optimization actually is nast trap for many filesystems. > > For many modern filesystems i_size or other inode information is only > updated on I/O completion and we need to wait for I/O to finish before > we can write out the metadata. For old fashionen filesystems that > instanciate blocks during the actual write and also update the metadata > at that point it opens up a large window were we could expose uninitialized > blocks after a crash. While a few filesystems that need it already wait > for the I/O to finish inside their ->fsync methods it is rather suboptimal > as it is done under the i_mutex and also always for the whole file instead > of just a part as we could do for O_SYNC handling. > > Here is a small audit of all fsync instances in the tree: > > - spufs_mfc_fsync: > - ps3flash_fsync: > - vol_cdev_fsync: > - printer_fsync: > - fb_deferred_io_fsync: > - bad_file_fsync: > - simple_sync_file: > > don't care - filesystems/drivers do't use the page cache or are > purely in-memory. > > - simple_fsync: > - file_fsync: > - affs_file_fsync: > - fat_file_fsync: > - jfs_fsync: > - ubifs_fsync: > - reiserfs_dir_fsync: > - reiserfs_sync_file: > > never touch pagecache themselves. We need to wait before if we do > not want to expose stale data after an allocation. > > - afs_fsync: > - fuse_fsync_common: > > do the waiting writeback itself in awkward ways, would benefit from > proper semantics > > - block_fsync: > > Does a filemap_write_and_wait on the block device inode. Because we > now have f_mapping that is the same inode we call it on in vfs_fsync. > So just removing it and letting the VFS do the work in one go would > be an improvement. > > - btrfs_sync_file: > - cifs_fsync: > - xfs_file_fsync: > > need the wait first and currently do it themselves. would benefit from > doing it outside i_mutex. > > - coda_fsync: > - ecryptfs_fsync: > - exofs_file_fsync: > - shm_fsync: > > only passes the fsync through to the lower layer > > - ext3_sync_file: > > doesn't seem to care, comments are confusing. > > - ext4_sync_file: > > would need the wait to work correctly for delalloc mode with late > i_size updates. Otherwise the ext3 comment applies. > > > currently implemens it's own writeback and wait in an odd way, > could benefit from doing it properly. > > - gfs2_fsync: > > not needed for journaled data mode, but probably harmless there. > Currently writes back data asynchronously itself. Needs some > major audit. > > - hostfs_fsync: > > just calls fsync/datasync on the host FD. Without the wait before > data might not even be inflight yet if we're unlucky. > > - hpfs_file_fsync: > - ncp_fsync: > > no-ops. Dangerous before and after. > > - jffs2_fsync: > > just calls jffs2_flush_wbuf_gc, not sure how this relates to data. > > - nfs_fsync_dir: > > just increments stats, claims all directory operations are synchronous > > - nfs_file_fsync: > > only writes out data??? Looks very odd. > > - nilfs_sync_file: > > looks like it expects all data done, but not sure from the code > > - ntfs_dir_fsync: > - ntfs_file_fsync: > > appear to do their own data writeback. Very convoluted code. > > - ocfs2_sync_file: > > does it's own data writeback, but no wait. probably needs the wait. > > - smb_fsync: > > according to a comment expects all pages written already, probably needs > the wait before. > > > This patch only changes vfs_fsync_range, removal of the wait in the methods > that have it is left to the filesystem maintainers. Note that most > filesystems really do need an audit for their fsync methods given the > gems found in this very brief audit. > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > > Index: linux-2.6/fs/sync.c > =================================================================== > --- linux-2.6.orig/fs/sync.c 2009-09-02 15:03:41.073271287 -0300 > +++ linux-2.6/fs/sync.c 2009-09-02 15:04:34.401269249 -0300 > @@ -216,7 +216,7 @@ int vfs_fsync_range(struct file *file, s > goto out; > } > > - ret = filemap_fdatawrite_range(mapping, start, end); > + ret = filemap_write_and_wait_range(mapping, start, end); > > /* > * We need to protect against concurrent writers, which could cause > @@ -228,9 +228,6 @@ int vfs_fsync_range(struct file *file, s > ret = err; > mutex_unlock(&mapping->host->i_mutex); > > - err = filemap_fdatawait_range(mapping, start, end); > - if (!ret) > - ret = err; > out: > return ret; > } -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html