On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > Suggestion for another filesystems track topic. > > Some of you may remember the emotional(?) discussions that ensued > when the crashmonkey developers embarked on a mission to document > and verify filesystem crash recovery guaranties: > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@xxxxxxxxxxxxxx/ > > There are two camps among filesystem developers and every camp > has good arguments for wanting to document existing behavior and for > not wanting to document anything beyond "use fsync if you want any guaranty". > > I would like to take a suggestion proposed by Jan on a related discussion: > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@xxxxxxxxxxxxxx/ > > and make a proposal that may be able to meet the concerns of > both camps. > > The proposal is to add new APIs which communicate > crash consistency requirements of the application to the filesystem. > > Example API could look like this: > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER) > It's just an example. The API could take another form and may need > more barrier types (I proposed to use new file_sync_range() flags). > > The idea is simple though. > METADATA_BARRIER means all the inode metadata will be observed > after crash if rename is observed after crash. > DATA_BARRIER same for file data. > We may also want a "ALL_METADATA_BARRIER" and/or > "METADATA_DEPENDENCY_BARRIER" to more accurately > describe what SOMC guaranties actually provide today. > > The implementation is also simple. filesystem that currently > have SOMC behavior don't need to do anything to respect > METADATA_BARRIER and only need to call > filemap_write_and_wait_range() to respect DATA_BARRIER. > filesystem developers are thus not tying their hands w.r.t future > performance optimizations for operations that are not explicitly > requesting a barrier. > An update: Following the LSF session on $SUBJECT I had a discussion with Ted, Jan and Chris. We were all in agreement that linking an O_TMPFILE into the namespace is probably already perceived by users as the barrier/atomic operation that I am trying to describe. So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of providing the required semantics when linking O_TMPFILE *as long* as the semantics are properly documented. This is what open(2) man page has to say right now: * Creating a file that is initially invisible, which is then populated with data and adjusted to have appropriate filesystem attributes (fchown(2), fchmod(2), fsetxattr(2), etc.) before being atomically linked into the filesystem in a fully formed state (using linkat(2) as described above). The phrase that I would like to add (probably in link(2) man page) is: "The filesystem provided the guaranty that after a crash, if the linked O_TMPFILE is observed in the target directory, than all the data and metadata modifications made to the file before being linked are also observed." For some filesystems, btrfs in farticular, that would mean an implicit fsync on the linked inode. On other filesystems, ext4/xfs in particular that would only require at least committing delayed allocations, but will NOT require inode fsync nor journal commit/flushing disk caches. I would like to hear the opinion of XFS developers and filesystem maintainers who did not attend the LSF session. I have no objection to adding an opt-in LINK_ATOMIC flag and pass it down to filesystems instead of changing behavior and patching stable kernels, but I prefer the latter. I believe this should have been the semantics to begin with if for no other reason, because users would expect it regardless of whatever we write in manual page and no matter how many !!!!!!!! we use for disclaimers. And if we can all agree on that, then O_TMPFILE is quite young in historic perspective, so not too late to call the expectation gap a bug and fix it.(?) Taking this another step forward, if we agree on the language I used above to describe the expected behavior, then we can add an opt-in RENAME_ATOMIC flag to provide the same semantics and document it in the same manner (this functionality is needed for directories and non regular files) and all there is left is the fun part of choosing the flag name ;-) Thanks, Amir.