On Tue, Aug 17, 2010 at 09:20:37AM -0500, Anthony Liguori wrote: > On 08/17/2010 08:07 AM, Christoph Hellwig wrote: > >>The point is that we don't want to flush the disk write cache. The > >>intention of writethrough is not to make the disk cache writethrough > >>but to treat the host's cache as writethrough. > > > >We need to make sure data is not in the disk write cache if want to > >provide data integrity. > > When the guest explicitly flushes the emulated disk's write cache. > Not on every single write completion. That depends on the cache= mode. For cache=none and cache=writeback we present a write-back cache to the guest, and the guest does explicit cache flushes. For cache=writethrough we present a writethrough cache to the guest, and we need to make sure data actually has hit the disk before returning I/O completion to the guest. > > It has nothing to do with the qemu caching > >mode - for data=writeback or none it's commited as part of the fdatasync > >call, and for data=writethrough it's commited as part of the O_SYNC > >write. Note that both these path end up calling the filesystems ->fsync > >method which is what's require to make writes stable. That's exactly > >what is missing out in sync_file_range, and that's why that API is not > >useful at all for data integrity operations. > > For normal writes from a guest, we don't need to follow the write > with an fsync(). We should only need to issue an fsync() given an > explicit flush from the guest. Define normal writes. For cache=none and cache=writeback we don't have to, and instead do explicit calls to fsync()/fdatasync() calls when a we a cache flush from the guest. For data=writethrough we guarantee data has made it to disk, and we implement this using O_DSYNC/O_SYNC when opening the file. That tells the operating system to not return until data has hit the disk. For Linux this is internally implement using a range-fsync/fdatasync after the actual write. > fsync() being slow is orthogonal to my point. I don't see why we > need to do an fsync() on *every* write. It should only be necessary > when a guest injects an actual barrier. See above. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html