Thank you for driving this discussion Amir. I'm glad ext4 and btrfs developers want to provide these semantics. If I'm understanding this correctly, the new semantics will be: any data changes to files written with O_TMPFILE will be visible if the associated metadata is also visible. Basically, there will be a barrier between O_TMPFILE data and O_TMPFILE metadata. The expectation is that applications will use this, and then rename the O_TMPFILE file over the original file. Is this correct? If so, is there also an implied barrier between O_TMPFILE metadata and the rename? Where does this land us on the discussion about documenting file-system crash-recovery guarantees? Has that been deemed not necessary? Thanks, Vijay Chidambaram http://www.cs.utexas.edu/~vijay/ On Thu, May 2, 2019 at 11:12 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > Suggestion for another filesystems track topic. > > > > Some of you may remember the emotional(?) discussions that ensued > > when the crashmonkey developers embarked on a mission to document > > and verify filesystem crash recovery guaranties: > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@xxxxxxxxxxxxxx/ > > > > There are two camps among filesystem developers and every camp > > has good arguments for wanting to document existing behavior and for > > not wanting to document anything beyond "use fsync if you want any guaranty". > > > > I would like to take a suggestion proposed by Jan on a related discussion: > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@xxxxxxxxxxxxxx/ > > > > and make a proposal that may be able to meet the concerns of > > both camps. > > > > The proposal is to add new APIs which communicate > > crash consistency requirements of the application to the filesystem. > > > > Example API could look like this: > > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER) > > It's just an example. The API could take another form and may need > > more barrier types (I proposed to use new file_sync_range() flags). > > > > The idea is simple though. > > METADATA_BARRIER means all the inode metadata will be observed > > after crash if rename is observed after crash. > > DATA_BARRIER same for file data. > > We may also want a "ALL_METADATA_BARRIER" and/or > > "METADATA_DEPENDENCY_BARRIER" to more accurately > > describe what SOMC guaranties actually provide today. > > > > The implementation is also simple. filesystem that currently > > have SOMC behavior don't need to do anything to respect > > METADATA_BARRIER and only need to call > > filemap_write_and_wait_range() to respect DATA_BARRIER. > > filesystem developers are thus not tying their hands w.r.t future > > performance optimizations for operations that are not explicitly > > requesting a barrier. > > > > An update: Following the LSF session on $SUBJECT I had a discussion > with Ted, Jan and Chris. > > We were all in agreement that linking an O_TMPFILE into the namespace > is probably already perceived by users as the barrier/atomic operation that > I am trying to describe. > > So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of > providing the required semantics when linking O_TMPFILE *as long* as > the semantics are properly documented. > > This is what open(2) man page has to say right now: > > * Creating a file that is initially invisible, which is then > populated with data > and adjusted to have appropriate filesystem attributes (fchown(2), > fchmod(2), fsetxattr(2), etc.) before being atomically linked into the > filesystem in a fully formed state (using linkat(2) as described above). > > The phrase that I would like to add (probably in link(2) man page) is: > "The filesystem provided the guaranty that after a crash, if the linked > O_TMPFILE is observed in the target directory, than all the data and > metadata modifications made to the file before being linked are also > observed." > > For some filesystems, btrfs in farticular, that would mean an implicit > fsync on the linked inode. On other filesystems, ext4/xfs in particular > that would only require at least committing delayed allocations, but > will NOT require inode fsync nor journal commit/flushing disk caches. > > I would like to hear the opinion of XFS developers and filesystem > maintainers who did not attend the LSF session. > > I have no objection to adding an opt-in LINK_ATOMIC flag > and pass it down to filesystems instead of changing behavior and > patching stable kernels, but I prefer the latter. > > I believe this should have been the semantics to begin with > if for no other reason, because users would expect it regardless > of whatever we write in manual page and no matter how many > !!!!!!!! we use for disclaimers. > > And if we can all agree on that, then O_TMPFILE is quite young > in historic perspective, so not too late to call the expectation gap > a bug and fix it.(?) > > Taking this another step forward, if we agree on the language > I used above to describe the expected behavior, then we can > add an opt-in RENAME_ATOMIC flag to provide the same > semantics and document it in the same manner (this functionality > is needed for directories and non regular files) and all there is left > is the fun part of choosing the flag name ;-) > > Thanks, > Amir.