Re: application level write ordering guarantees?

"Theodore Ts'o" <tytso@mit.edu> · Mon, 16 Dec 2002 09:26:52 -0500

On Sun, Dec 15, 2002 at 08:59:25PM -0800, Ben Escoto wrote:
> Hi, can someone tell me whether applications can expect the write
> requests they make to be executed in order?  For instance, suppose an
> application requests that a file be deleted, and then that another
> file be moved to an unrelated place.  Will these events always happen
> in that order?  Or to put it another way, if something unexpected
> happens in the meantime (say the computer crashes), is it guaranteed
> that just the second action won't have been performed (i.e. that the
> second action was done first and the crash happened just after that)?
> 
>     How about if a file is written (and closed) and then a different
> file is moved?  Is it possible that the second file gets moved before
> all the data is written?
> 
>     Does this depend on the file system or do all/most filesystems
> behave the same way?  Sorry if this is common knowledge, but I googled
> for a while and couldn't find anything.  If it matters, I am trying to
> make sure a backup program (see http://rdiff-backup.stanford.edu)
> doesn't lose data.  Thanks for any information.

In general, applications can't expect anything about data write
ordering unless they use fsync(), which guarantees that everything
written up to that point is flushed out to disk.  And with filesystems
that do not have journalling support, in general there is no
guarantees at all about whether file deletes, renames, etc. will be
committed to disk in any order, or if they will happen at all in the
advent of a crash.  Nor is there any guaratees that data blocks will
be written even if the filesystem metadata changes are made (i.e., you
can write a file, then rename it, and the rename might take, but the
data blocks might still not be written).  In general, if you want to
gaurantee that something is flushed out to disk, use fsync().  And if
you care about ordering, then the application may need to do its own
application-level journaling (that's what most databases do, for
example).

Some filesystems will give you better guarantees.  For example,
filesystems that provide journalling will generally guarantee that
metadata operations will be committed to the filesystem in order.
However, many journaled filesystems will not guarantee anything about
the data blocks; the purpose of the journal in many cases is simply to
avoid long fsck runs in case of a crash, not to ensure application
data integrity --- that's the job of the application and fsync().

With ext3, this can be controlled using the mount options
"data=journal", "data=ordered", and "data=writeback".  In
data=journaled, all data is committed into the journal before it is
written into the main filesystem.  In data=ordered, which is the
default, data blocks are forced to the main filesystem before the
metadata is committed into the journal.  And finally, in
"data=writeback", there is no ordering guaratees with regards to data
blocks.  The last generally has the best performance, although it has
the least amount of guarantees about ordering.

Discussions about performance is quite tricky, because it depends on
your work load, and what you're measuring.  When you write the data
into the journal, as in data=journal, data has to end up getting
written twice --- once to the journal, and once to the final location
on disk.  On the other hand, writing into the journal doesn't require
any seeks (assuming the journal is allocated contiguously, as it would
be on a newly created filesystem), and writes to the filesystem can
happen at the system's leisure, when it doesn't have higher priority
things to do.  As a result, on workloads with the amount of writes
into the filesystem are moderate, and where there is a lot of fsync()
performed by the application to guarantee data integrity, data=journal
may actually perform very well.  On the other hand, if the worload (or
benchmark) attempts to use all of the disk's read/write bandwidth,
then the double write implied by data=journal will be quite painful
indeed.

In contrast, "data=ordered" will delay journal commits until the data
blocks can be written onto disk.  This eliminates the double writes,
but can cause the disk to seek much more heavily, since a journal
commit now requires data blocks located all of the disk to be florced
out before the journal commit record can be written.  So depending on
the benchmark or workload, this can cost you performance, especially
if the application is calling fsync() a lot.

"data=writeback" is useful when trying to benchmark ext3 versus other
journaling filesystems, since it allows for an apples-to-apples
comparison.  This is because many journaling filesystems don't give
any guarantees about data consistency.  Why?  Because as noted above,
giving such guarantees costs performance, and why lose performance if
it's not needed?  If the application does need such guarantees, it can
use fsync(), and then it can pay the cost of fsync() only where it is
needed, and not in other places.

In any case, I think you were much more concerned about data
guarantees rather than performance --- generally the right attitude.
:-) In that case, I would suggest making sure that rdiff-backup uses
fsync() where it is important to gurantee that the data has been
flushed to disk, and not relying on the filesystem to give you any
consistency guarantees.  This generally will give you the best
combination of performance and data integrity.

I hope this helps!

					- Ted

_______________________________________________

Ext3-users@redhat.com
https://listman.redhat.com/mailman/listinfo/ext3-users