On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote: > On Wed, 5 May 2010 14:28:41 +0930 > Rusty Russell <rusty@xxxxxxxxxxxxxxx> wrote: > > > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: > > > Jens Axboe wrote: > > > > On Tue, May 04 2010, Rusty Russell wrote: > > > > > ISTR someone mentioning a desire for such an API years ago, so CC'ing the > > > > > usual I/O suspects... > > > > > > > > It would be nice to have a more fuller API for this, but the reality is > > > > that only the flush approach is really workable. Even just strict > > > > ordering of requests could only be supported on SCSI, and even there the > > > > kernel still lacks proper guarantees on error handling to prevent > > > > reordering there. > > > > > > There's a few I/O scheduling differences that might be useful: > > > > > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not > > > before a BARRIER. That might be useful for time-critical WRITEs, > > > and those issued by high I/O priority. > > > > This is only because noone actually wants flushes or barriers, though > > I/O people seem to only offer that. We really want "<these writes> must > > occur before <this write>". That offers maximum choice to the I/O subsystem > > and potentially to smart (virtual?) disks. > > > > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is > > > only for data belonging to a particular file (e.g. fdatasync with > > > no file size change, even on btrfs if O_DIRECT was used for the > > > writes being committed). That would entail tagging FLUSHes and > > > WRITEs with a fs-specific identifier (such as inode number), opaque > > > to the scheduler which only checks equality. > > > > This is closer. In userspace I'd be happy with a "all prior writes to this > > struct file before all future writes". Even if the original guarantees were > > stronger (ie. inode basis). We currently implement transactions using 4 fsync > > /msync pairs. > > > > write_recovery_data(fd); > > fsync(fd); > > msync(mmap); > > write_recovery_header(fd); > > fsync(fd); > > msync(mmap); > > overwrite_with_new_data(fd); > > fsync(fd); > > msync(mmap); > > remove_recovery_header(fd); > > fsync(fd); > > msync(mmap); > > Seems over-zealous. > If the recovery_header held a strong checksum of the recovery_data you would > not need the first fsync, and as long as you have two places to write recovery > data, you don't need the 3rd and 4th syncs. > Just: > write_internally_checksummed_recovery_data_and_header_to_unused_log_space() > fsync / msync > overwrite_with_new_data() > > To recovery you choose the most recent log_space and replay the content. > That may be a redundant operation, but that is no loss. I think you missed a checksum for the new data? Otherwise we can't tell if the new data is completely written. But yes, I will steal this scheme for TDB2, thanks! In practice, it's the first sync which is glacial, the rest are pretty cheap. > Also cannot see the point of msync if you have already performed an fsync, > and if there is a point, I would expect you to call msync before > fsync... Maybe there is some subtlety there that I am not aware of. I assume it's this from the msync man page: msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to disk. Without use of this call there is no guarantee that changes are written back before mun‐ map(2) is called. > > It's an implementation detail; barrier has less flexibility because it has > > less information about what is required. I'm saying I want to give you as > > much information as I can, even if you don't use it yet. > > Only we know that approach doesn't work. > People will learn that they don't need to give the extra information to still > achieve the same result - just like they did with ext3 and fsync. > Then when we improve the implementation to only provide the guarantees that > you asked for, people will complain that they are getting empty files that > they didn't expect. I think that's an oversimplification: IIUC that occurred to people *not* using fsync(). They weren't using it because it was too slow. Providing a primitive which is as fast or faster and more specific doesn't have the same magnitude of social issues. And we can't write userspace interfaces for idiots only. > The abstraction I would like to see is a simple 'barrier' that contains no > data and has a filesystem-wide effect. I think you lack ambition ;) Thinking about the single-file use case (eg. kvm guest or tdb), isn't that suboptimal for md? Since you have to hand your barrier to every device whereas a file-wide primitive may theoretically only go to some. Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html