Re: Documenting the crash consistency guarantees of file systems

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 14 Feb 2019 12:47:28 +1100

On Wed, Feb 13, 2019 at 12:35:16PM -0600, Vijay Chidambaram wrote:
> On Wed, Feb 13, 2019 at 12:22 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > On Wed, Feb 13, 2019 at 7:06 PM Jayashree Mohan <jaya@xxxxxxxxxxxxx> wrote:
> AFAIK, any file system which persists things out of order to increase
> performance does not provide strictly ordered metadata semantics.

Define "things", please.

And while you are there, define "persist", please.

XFS can "persist" "things" out of order because it has a multi-phase
checkpointing subsystem and we can't control IO ordering during
concurrent checkpoint writeout. The out of order checkpoints don't
get put back in order until recovery is run - it reorders everything
that is in the journal into correct sequence order before it starts
recovery.

IOWs, the assertion that we must "persist" things in strict order to
maintain ordered metadata semantics is incorrect. /Replay/ of the
changes being recovered must be done in order, but that does not
require them to be persisted to stable storage in strict order.

> These semantics seem to indicate a total ordering among all
> operations, and an fsync should persist all previous operations (as
> ext3 used to do).

No, absolutely not.

You're talking about /globally ordered metadata/. This was the
Achille's Heel of ext3, resulting in fsync being indistinguishable
from sync and hence being horrifically slow. This caused a
generation of linux application developers to avoid using fsync and
causing users (and fs developers who got blamed for losing data)
endless amounts of pain when files went missing after crashes.
Hindsight teaches us that ext3's behaviour was a horrible mistake
and not one we want to repeat.

Strictly ordered metadata only requires persisting all the previous
dependent modifications to the objects we need to persist.
i.e. if you fsync an inode we just allocated, then we also have to
persist the changes in the same transaction and then all the
previous changes that are dependent on that set of objects, and so
one all the way back to objects being clean on disk. If we crash,
we can then rebuild all of the information the user persisted
correctly.

There is no requirement for any other newly created inode elsewhere
in the filesystem to be present after fsync of the first file.
Indepdnent file creation will only be persisted by the fsync if
there is a shared object modification dependency between those two
files elsewhere in metadata. e.g. they are both in the same inode
cluster so share an inode btree block modification that marks them
as used, hence if one it persisted, the btree block is persisted,
and hence the other inode and all it's dependencies need to be
persisted as well.

That dependency tree is the "strict ordering" we talk about. At
times it can look like "globally ordered metadata", but for
indepedent changes it will only result in the small number of
dependencies being persisted and not the entire filesystem (as per
ext3).

> Note that Jayashree and I aren't arguing file systems should provide
> this semantics, merely that ext4 and btrfs violate it at certain
> points.

As does XFS.

http://xfs.org/index.php/XFS_FAQ#Q:_Why_do_I_see_binary_NULLS_in_some_files_after_recovery_when_I_unplugged_the_power.3F

ext4 inherited all it's delalloc vs metadata ordering issues from
XFS as ext4 really just copied the XFS design without understanding
all the problems it had. Then when users started reporting problems
the problems we'd fixed with XFS they copied a number of the
mitigations from XFS as well....

-- 
Dave Chinner
david@xxxxxxxxxxxxx