Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 2 May 2019 12:12:22 -0400

On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> Suggestion for another filesystems track topic.
>
> Some of you may remember the emotional(?) discussions that ensued
> when the crashmonkey developers embarked on a mission to document
> and verify filesystem crash recovery guaranties:
>
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@xxxxxxxxxxxxxx/
>
> There are two camps among filesystem developers and every camp
> has good arguments for wanting to document existing behavior and for
> not wanting to document anything beyond "use fsync if you want any guaranty".
>
> I would like to take a suggestion proposed by Jan on a related discussion:
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@xxxxxxxxxxxxxx/
>
> and make a proposal that may be able to meet the concerns of
> both camps.
>
> The proposal is to add new APIs which communicate
> crash consistency requirements of the application to the filesystem.
>
> Example API could look like this:
> renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER)
> It's just an example. The API could take another form and may need
> more barrier types (I proposed to use new file_sync_range() flags).
>
> The idea is simple though.
> METADATA_BARRIER means all the inode metadata will be observed
> after crash if rename is observed after crash.
> DATA_BARRIER same for file data.
> We may also want a "ALL_METADATA_BARRIER" and/or
> "METADATA_DEPENDENCY_BARRIER" to more accurately
> describe what SOMC guaranties actually provide today.
>
> The implementation is also simple. filesystem that currently
> have SOMC behavior don't need to do anything to respect
> METADATA_BARRIER and only need to call
> filemap_write_and_wait_range() to respect DATA_BARRIER.
> filesystem developers are thus not tying their hands w.r.t future
> performance optimizations for operations that are not explicitly
> requesting a barrier.
>

An update: Following the LSF session on $SUBJECT I had a discussion
with Ted, Jan and Chris.

We were all in agreement that linking an O_TMPFILE into the namespace
is probably already perceived by users as the barrier/atomic operation that
I am trying to describe.

So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of
providing the required semantics when linking O_TMPFILE *as long* as
the semantics are properly documented.

This is what open(2) man page has to say right now:

 *  Creating a file that is initially invisible, which is then
populated with data
    and adjusted to have  appropriate  filesystem  attributes  (fchown(2),
    fchmod(2), fsetxattr(2), etc.)  before being atomically linked into the
    filesystem in a fully formed state (using linkat(2) as described above).

The phrase that I would like to add (probably in link(2) man page) is:
"The filesystem provided the guaranty that after a crash, if the linked
 O_TMPFILE is observed in the target directory, than all the data and
 metadata modifications made to the file before being linked are also
 observed."

For some filesystems, btrfs in farticular, that would mean an implicit
fsync on the linked inode. On other filesystems, ext4/xfs in particular
that would only require at least committing delayed allocations, but
will NOT require inode fsync nor journal commit/flushing disk caches.

I would like to hear the opinion of XFS developers and filesystem
maintainers who did not attend the LSF session.

I have no objection to adding an opt-in LINK_ATOMIC flag
and pass it down to filesystems instead of changing behavior and
patching stable kernels, but I prefer the latter.

I believe this should have been the semantics to begin with
if for no other reason, because users would expect it regardless
of whatever we write in manual page and no matter how many
!!!!!!!! we use for disclaimers.

And if we can all agree on that, then O_TMPFILE is quite young
in historic perspective, so not too late to call the expectation gap
a bug and fix it.(?)

Taking this another step forward, if we agree on the language
I used above to describe the expected behavior, then we can
add an opt-in RENAME_ATOMIC flag to provide the same
semantics and document it in the same manner (this functionality
is needed for directories and non regular files) and all there is left
is the fun part of choosing the flag name ;-)

Thanks,
Amir.