On Sun, Dec 15, 2002 at 08:59:25PM -0800, Ben Escoto wrote: > Hi, can someone tell me whether applications can expect the write > requests they make to be executed in order? For instance, suppose an > application requests that a file be deleted, and then that another > file be moved to an unrelated place. Will these events always happen > in that order? Or to put it another way, if something unexpected > happens in the meantime (say the computer crashes), is it guaranteed > that just the second action won't have been performed (i.e. that the > second action was done first and the crash happened just after that)? > > How about if a file is written (and closed) and then a different > file is moved? Is it possible that the second file gets moved before > all the data is written? > > Does this depend on the file system or do all/most filesystems > behave the same way? Sorry if this is common knowledge, but I googled > for a while and couldn't find anything. If it matters, I am trying to > make sure a backup program (see http://rdiff-backup.stanford.edu) > doesn't lose data. Thanks for any information. In general, applications can't expect anything about data write ordering unless they use fsync(), which guarantees that everything written up to that point is flushed out to disk. And with filesystems that do not have journalling support, in general there is no guarantees at all about whether file deletes, renames, etc. will be committed to disk in any order, or if they will happen at all in the advent of a crash. Nor is there any guaratees that data blocks will be written even if the filesystem metadata changes are made (i.e., you can write a file, then rename it, and the rename might take, but the data blocks might still not be written). In general, if you want to gaurantee that something is flushed out to disk, use fsync(). And if you care about ordering, then the application may need to do its own application-level journaling (that's what most databases do, for example). Some filesystems will give you better guarantees. For example, filesystems that provide journalling will generally guarantee that metadata operations will be committed to the filesystem in order. However, many journaled filesystems will not guarantee anything about the data blocks; the purpose of the journal in many cases is simply to avoid long fsck runs in case of a crash, not to ensure application data integrity --- that's the job of the application and fsync(). With ext3, this can be controlled using the mount options "data=journal", "data=ordered", and "data=writeback". In data=journaled, all data is committed into the journal before it is written into the main filesystem. In data=ordered, which is the default, data blocks are forced to the main filesystem before the metadata is committed into the journal. And finally, in "data=writeback", there is no ordering guaratees with regards to data blocks. The last generally has the best performance, although it has the least amount of guarantees about ordering. Discussions about performance is quite tricky, because it depends on your work load, and what you're measuring. When you write the data into the journal, as in data=journal, data has to end up getting written twice --- once to the journal, and once to the final location on disk. On the other hand, writing into the journal doesn't require any seeks (assuming the journal is allocated contiguously, as it would be on a newly created filesystem), and writes to the filesystem can happen at the system's leisure, when it doesn't have higher priority things to do. As a result, on workloads with the amount of writes into the filesystem are moderate, and where there is a lot of fsync() performed by the application to guarantee data integrity, data=journal may actually perform very well. On the other hand, if the worload (or benchmark) attempts to use all of the disk's read/write bandwidth, then the double write implied by data=journal will be quite painful indeed. In contrast, "data=ordered" will delay journal commits until the data blocks can be written onto disk. This eliminates the double writes, but can cause the disk to seek much more heavily, since a journal commit now requires data blocks located all of the disk to be florced out before the journal commit record can be written. So depending on the benchmark or workload, this can cost you performance, especially if the application is calling fsync() a lot. "data=writeback" is useful when trying to benchmark ext3 versus other journaling filesystems, since it allows for an apples-to-apples comparison. This is because many journaling filesystems don't give any guarantees about data consistency. Why? Because as noted above, giving such guarantees costs performance, and why lose performance if it's not needed? If the application does need such guarantees, it can use fsync(), and then it can pay the cost of fsync() only where it is needed, and not in other places. In any case, I think you were much more concerned about data guarantees rather than performance --- generally the right attitude. :-) In that case, I would suggest making sure that rdiff-backup uses fsync() where it is important to gurantee that the data has been flushed to disk, and not relying on the filesystem to give you any consistency guarantees. This generally will give you the best combination of performance and data integrity. I hope this helps! - Ted _______________________________________________ Ext3-users@redhat.com https://listman.redhat.com/mailman/listinfo/ext3-users