Rusty Russell wrote: > > Seems over-zealous. > > If the recovery_header held a strong checksum of the recovery_data you would > > not need the first fsync, and as long as you have two places to write recovery > > data, you don't need the 3rd and 4th syncs. > > Just: > > write_internally_checksummed_recovery_data_and_header_to_unused_log_space() > > fsync / msync > > overwrite_with_new_data() > > > > To recovery you choose the most recent log_space and replay the content. > > That may be a redundant operation, but that is no loss. > > I think you missed a checksum for the new data? Otherwise we can't tell if > the new data is completely written. The data checksum can go in the recovery-data block. If there's enough slack in the log, by the time that recovery-data block is overwritten, you can be sure that an fsync has been done for that data (by a later commit). > But yes, I will steal this scheme for TDB2, thanks! Take a look at the filesystems. I think ext4 did some optimisations in this area, and that checksums had to be added anyway due to a subtle replay-corruption problem that happens when the log is partially corrupted, and followed by non-corrupt blocks. Also, you can remove even more fsyncs by adding a bit of slack to the data space and writing into unused/fresh areas some of the time - i.e. a bit like btrfs/zfs or anything log-structured, but you don't have to go all the way with that. > In practice, it's the first sync which is glacial, the rest are pretty cheap. The 3rd and 4th fsyncs imply a disk seek each, just because the preceding writes are to different areas of the disk. Seeks are quite slow - but not as slow as ext3 fsyncs :-) What do you mean by cheap? That it's only a couple of seeks, or that you don't see even that? > > > Also cannot see the point of msync if you have already performed an fsync, > > and if there is a point, I would expect you to call msync before > > fsync... Maybe there is some subtlety there that I am not aware of. > > I assume it's this from the msync man page: > > msync() flushes changes made to the in-core copy of a file that was > mapped into memory using mmap(2) back to disk. Without use of this > call there is no guarantee that changes are written back before mun‐ > map(2) is called. Historically, that means msync() ensures dirty mapping data is written to the file as if with write(), and that mapping pages are removed or refreshed to get the effect of read() (possibly a lazy one). It's more obvious in the early mmap implementations where mappings don't share pages with the filesystem cache, so msync() has explicit behaviour. Like with write(), after calling msync() you would then call fsync() to ensure the data is flushed to disk. If you've been calling fsync then msync, I guess that's another fine example of how these function are so hard to test, that they aren't. Historically on Linux, msync has been iffy on some architectures, and I'm still not sure it has the same semantics as other unixes. fsync as we know has also been iffy, and even now that fsync is tidier it does not always issue a hardware-level cache commit. But then historically writable mmap has been iffy on a boatload of unixes. > > > It's an implementation detail; barrier has less flexibility because it has > > > less information about what is required. I'm saying I want to give you as > > > much information as I can, even if you don't use it yet. > > > > Only we know that approach doesn't work. > > People will learn that they don't need to give the extra information to still > > achieve the same result - just like they did with ext3 and fsync. > > Then when we improve the implementation to only provide the guarantees that > > you asked for, people will complain that they are getting empty files that > > they didn't expect. > > I think that's an oversimplification: IIUC that occurred to people *not* > using fsync(). They weren't using it because it was too slow. Providing > a primitive which is as fast or faster and more specific doesn't have the > same magnitude of social issues. I agree with Rusty. Let's make it perform well so there is no reason to deliberately avoid using it, and let's make say what apps actually want to request without being way too strong. And please, if anyone has ideas on how we could make correct use of these functions *testable* by app authors, I'm all ears. Right now it is quite difficult - pulling power on hard disks mid-transaction is not a convenient method :) > > The abstraction I would like to see is a simple 'barrier' that contains no > > data and has a filesystem-wide effect. > > I think you lack ambition ;) > > Thinking about the single-file use case (eg. kvm guest or tdb), isn't that > suboptimal for md? Since you have to hand your barrier to every device > whereas a file-wide primitive may theoretically only go to some. Yes. Note that database-like programs still need fsync-like behaviour *sometimes*: The "D" in ACID depends on it, and the "C" in ACID also depends on it where multiple files are involved which must contain consistent data with each other after crash/recovery (Perhaps Samba depends on this?) Single-file sync is valuable just like single-file barrier, and so is the combination. Since you mentioned ambition, think about multi-file updates. They're analogous in userspace to MD's barrier/sync requirements in kernelspace. One API that supports multi-file update barriers is "long aio-fsync": Something which returns when the data in earlier writes (to one file) is committed, but does not force the commit to happen more quickly than normal. Both single-file barriers (like you want for TDB) and multi-file barriers can be implemented on top of that, but it's much more difficult to use than an fbarrier() syscall, which is only suitable for single-file. But I wonder if there would be many users of fbarrier() who aren't perfectly capable of using something else if needed. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html