Re: [PATCH] Documenting the crash-recovery guarantees of Linux file systems

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 6 Mar 2019 11:14:30 +0200

On Wed, Mar 6, 2019 at 4:59 AM Jayashree <jaya@xxxxxxxxxxxxx> wrote:
>
>  In this file, we document the crash-recovery guarantees
>  provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
>  present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
>  (SOMC), which is provided by xfs. It is not clear to us if other file systems
>  provide SOMC

Nice work.
You may add
Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx>

Few nits below.

> ; we would be happy to modify the document if file-system
>  developers claim that their system provides (or aims to provide) SOMC.

This part belongs after the --- line
IOW, it does not belong in the commit message.

>
> Signed-off-by: Jayashree Mohan <jaya@xxxxxxxxxxxxx>
> ---
>  .../filesystems/crash-recovery-guarantees.txt      | 173 +++++++++++++++++++++
>  1 file changed, 173 insertions(+)
>  create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt
>
> diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt
> new file mode 100644
> index 0000000..4d1a9c6b
> --- /dev/null
> +++ b/Documentation/filesystems/crash-recovery-guarantees.txt
> @@ -0,0 +1,173 @@
> +=====================================================================
> +File System Crash-Recovery Guarantees
> +=====================================================================
> +Linux file systems provide certain guarantees to user-space
> +applications about what happens to their data if the system crashes
> +(due to power loss or kernel panic). These are termed crash-recovery
> +guarantees.
> +
> +Crash-recovery guarantees only pertain to data or metadata that has
> +been explicitly persisted to storage with fsync(), fdatasync(), or
> +sync() system calls. By default, write(), mkdir(), and other
> +file-system related system calls only affect the in-memory state of
> +the file system.
> +
> +The crash-recovery guarantees provided by most Linux file systems are
> +significantly stronger than what is required by POSIX. POSIX is vague,
> +even allowing fsync() to do nothing (Mac OSX takes advantage of
> +this). However, the guarantees provided by file systems are not
> +documented, and vary between file systems. This document seeks to
> +describe the current crash-recovery guarantees provided by major Linux
> +file systems.
> +
> +What does the fsync() operation guarantee?
> +----------------------------------------------------
> +fsync() operation is meant to force the physical write of data
> +corresponding to a file from the buffer cache, along with the file
> +metadata. Note that the guarantees mentioned for each file system below
> +are in addition to the ones provided by POSIX.
> +
> +POSIX
> +-----
> +fsync(file) : Flushes the data and metadata associated with the
> +file. However, if the directory entry for the file has not been
> +previously persisted, or has been modified, it is not guaranteed to be
> +persisted by the fsync of the file [1]. What this means is, if a file
> +is newly created, you will have to fsync(parent directory) in addition
> +to fsync(file) in order to ensure that the file data has safely
> +reached the disk.

No. In order to ensure that the file's *directory entry* will persist.
Throughout the doc, if you just say "file will persist" the meaning
is ambiguous. "file data will persist" "file metadata will persist"
and "file directory entry will persist" are three distinguished
outcomes.

> +
> +fsync(dir) : Flushes directory data and directory entries. However if
> +you created a new file within the directory and wrote data to the
> +file, then the file data is not guaranteed to be persisted, unless an
> +explicit fsync() is issued on the file.
> +
> +ext4
> +-----
> +fsync(file) : Ensures that a newly created file is persisted (no need

newly created file directory entry is persisted

> +to explicitly persist the parent directory). However, if you create
> +multiple names of the file (hard links), then they are not guaranteed
> +to persist unless each one of the hard links are persisted [2].

"...then the hard linked directory entries are not guarantied to persist
unless each one of the parent directories are persisted."

> +
> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data.
> +
> +btrfs
> +------
> +fsync(file) : Ensures that the newly created file is persisted, along
> +with all its hard links. You do not need to persist individual hard
> +links to the file.

Rephrase to disambiguate

> +
> +fsync(dir) : All the file names within the directory persist. All the
> +rename and unlink operations within the directory are persisted. Due
> +to the design choices made by btrfs, fsync of a directory could lead
> +to an iterative fsync on sub-directories, thereby requiring a full
> +file system commit. So btrfs does not advocate persisting directories
> +[2].
> +
> +fsync(symlink)
> +-------------
> +A symlink inode cannot be directly opened for IO, which means there is
> +no such thing as fsync of a symlink [3]. You could be tricked by the
> +fact that open and fsync of a symlink succeeds without returning a
> +error, but what happens in reality is as follows.
> +
> +Suppose we have a symlink “foo”, which points to the file “A/bar”
> +
> +fd = open(“foo”, O_CREAT | O_RDWR)
> +fsync(fd)
> +
> +Both the above operations succeed, but if you crash after fsync, the
> +symlink could be still missing.
> +
> +When you try to open the symlink “foo”, you are actually trying to
> +open the file that the symlink resolves to, which in this case is
> +“A/bar”. When you fsync the inode returned by the open system call, you
> +are actually persisting the file “A/bar” and not the symlink. Note
> +that if the file “A/bar” does not exist and you try the open the
> +symlink “foo” without the O_CREAT flag, then file open will fail. To
> +obtain the file descriptor associated with the symlink inode, you
> +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the
> +file descriptor obtained this way can be only used to indicate a
> +location in the file-system tree and to perform operations that act
> +purely at the file descriptor level. Operations like read(), write(),
> +fsync() etc cannot be performed on such file descriptors.
> +
> +Bottomline : You cannot fsync() a symlink.
> +
> +fsync(special files)
> +--------------------
> +Special files in Linux include block and character device files
> +(created using mknod), FIFO (created using mkfifo) etc. Just like the
> +behavior of fsync on symlinks described above, these special files do
> +not have a fsync function defined. Similar to symlinks, you
> +cannot fsync a special file [4].
> +
> +
> +Strictly Ordered Metadata Consistency
> +-------------------------------------
> +With each file system providing varying levels of persistence
> +guarantees, a consensus in this regard, will benefit application
> +developers to work with certain fixed assumptions about file system
> +guarantees. Dave Chinner proposed a unified model called the
> +Strictly Ordered Metadata Consistency (SOMC) [5].
> +
> +Under this scheme, the file system guarantees to persist all previous
> +dependent modifications to the object upon fsync().  If you fsync() an
> +inode, it will persist all the changes required to reference the inode
> +and its data. SOMC can be defined as follows [6]:
> +
> +If op1 precedes op2 in program order (in-memory execution order), and
> +op1 and op2 share a dependency, then op2 must not be observed by a
> +user after recovery without also observing op1.
> +
> +Unfortunately, SOMC's definition depends upon whether two operations
> +share a dependency, which is file-system specific. A developer would
> +need to understand file-system internals to know if SOMC would order
> +one operation before another. It is worth noting that a file system
> +can be crash-consistent (according to POSIX), without providing SOMC
> +[7].
> +
> +Example
> +-------
> +touch A/foo
> +echo “hello” >  A/foo
> +sync
> +
> +mv A/foo A/bar
> +echo “world” > A/foo
> +fsync A/foo
> +CRASH
> +
> +What would you expect on recovery, if the file system crashed after
> +the final fsync returned successfully?
> +
> +Non SOMC file systems will not persist the file
> +A/bar because it was not explicitly fsync-ed. But this means, you will
> +find only the file A/foo with data “world” after crash, thereby losing
> +the previously persisted file with data “hello” [8]. You will need to
> +explicitly persist the directory A to ensure the rename operation is
> +safely persisted on disk.
> +
> +Under SOMC, to correctly reference the new inode via A/foo,
> +the previous rename operation must persist as well. Therefore,
> +fsync() of A/foo will persist the renamed file A/bar as well.
> +On recovery you will find both A/bar (with data “hello”)
> +and A/foo (with data “world”).
> +
> +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict)
> +and btrfs provide SOMC like behaviour in this particular example.
> +However, on document, only XFS claims to provide SOMC.
> +It is not clear if ext4, F2FS and btrfs provide strictly ordered
> +metadata consistency.
> +
> +--------------------------------------------------------
> +[1] http://man7.org/linux/man-pages/man2/fdatasync.2.html
> +[2] https://www.spinics.net/lists/linux-btrfs/msg77340.html
> +[3] https://www.spinics.net/lists/fstests/msg09370.html
> +[4] https://bugzilla.kernel.org/show_bug.cgi?id=202485
> +[5] https://marc.info/?l=fstests&m=155010885626284&w=2
> +[6] https://marc.info/?l=fstests&m=155011123126916&w=2
> +[7] https://www.spinics.net/lists/fstests/msg09379.html
> +[8] https://patchwork.kernel.org/patch/10132305/
> +
> --
> 2.7.4
>