On Fri, Aug 14, 2009 at 11:48:31PM +0200, Jan Kara wrote: > Hi, > > I was looking at generic_osync_inode() and found out it's kind of > inconsistent with what we have in fsync() path and does not work in all > cases e.g. on ext3 / ext4. The problem is that filesystem never actually > gets to know that it should sync all metadata needed to reach the data - > generic_osync_inode() only does sync_mapping_buffers() but e.g. ext3 / ext4 > don't track metadata buffers there. Then it does write_inode_now() which > would actually flush the journal, but it does so only in case inode is > I_DIRTY_DATASYNC... So there are cases where we sync the data but leave > metadata uncommitted. Yeah, it's not very useful. That's the reason why XFS doesn't rely on it but rather does a manual log force for the metadata. > What I'd imagine is that generic_osync_inode() would be just like > fdatasync call, only we'd have to add a possibility to avoid fdatawrite / > fdatawait as some callers submit / wait for data themselves. That would > nicely unify those syncing paths. > The only small problem is with an interface since ->fsync() callback > takes preferably struct file * and at least struct dentry *, while > generic_osync_inode takes just inode. Most of the callers actually have > a struct file * pointer but sync_page_range[_nolock]() do not, so that > would have to be solved somehow. Note that the current way ->fsync works is also rather problematic, for one thing all filesystems that touch metadata on data I/O completion (XFS, btrfs, ext4 and probably more) really want to first write out the data _and_ wait for it, and only then sync the inode. Right now it every filesystems has to do that itself and do it under i_mutex or drop/reacquire it which is quite stupid. The second issue is that was pass a file which is required for some filesystems (e.g. NFS) but which may be NULL when coming from NFSD. I'll need to look into fixing NFSD and always pass a file here. No back to the original generic_osync_inode / O_SYNC handling problem: All callers of sync_page_range actually do have a file pointer, and all of them are called from generic or filesystem specific write code, so passing the file pointer to it is no problem at all. sync_page_range_nolock just has a single caller in fat which does not have a file pointer. The right thing here is IMHO: (1) open-code sync_page_range_nolock in fat and just get rid of it as a generic helper (2) replace sync_page_range with a generic_write_sync or similar does the range writeouts + a call to ->fsync. Bonus points for also moving the O_SYNC / IS_SYNC checks into that helper. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html