Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 11 Apr 2019 14:08:39 +0300

On Thu, Apr 11, 2019 at 1:16 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Wed 10-04-19 13:42:23, Amir Goldstein wrote:
> > On Wed, Apr 10, 2019 at 12:10 PM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Tue 09-04-19 21:01:54, Amir Goldstein wrote:
> > > > On Tue, Apr 9, 2019 at 6:43 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > > On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > > > > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > > > > > writeback") claims that sync_file_range(2) syscall was "created for
> > > > > > userspace to be able to issue background writeout and so waiting for
> > > > > > in-flight IO is undesirable there" and changes the writeback (back) to
> > > > > > WB_SYNC_NONE.
> > > > > >
> > > > > > This claim is only partially true. Is is true for users that use the flag
> > > > > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > > > > > the reason for changing to WB_SYNC_NONE writeback.
> > > > > >
> > > > > > However, that claim is not true for users that use that flag combination
> > > > > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
> > > > > > requested to wait for in-flight IO as well as to writeback of dirty
> > > > > > pages.  sync_file_range(2) man page describes this flag combintaion as
> > > > > > "write-for-data-integrity operation", although some may argue against
> > > > > > this definition.
> > > > > >
> > > > > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> > > > > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > > > > > writeback, to perform the range sync request.
> > > > > >
> > > > > > One intended use case for this API is creating a dependency between
> > > > > > a new file's content and its link into the namepsace without requiring
> > > > > > journal commit and flushing of disk volatile caches:
> > > > > >
> > > > > >   fd = open(".", O_TMPFILE | O_RDWR);
> > > > > >   write(fd, newdata, count);
> > > > > >   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> > > > > >   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> > > > > >
> > > > > > For many local filesystems, ext4 and xfs included, the sequence above
> > > > > > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > > > > > content or 'newfile' does not exists.  For some applications, this
> > > > > > guaranty is strong enough and the effects of sync_file_range(2), even
> > > > > > with WB_SYNC_ALL, are far less intrusive to other writers in the system
> > > > > > than the effects of fdatasync(2).
> > > > >
> > > > > I agree that this paragraph is true but I don't want any userspace program
> > > > > rely on this. We've been through this when ext3 got converted to ext4 and
> > > > > it has caused a lot of complaints. Ext3 had provided rather strong data vs
> > > > > metadata ordering due to the way journalling was implemented. Then ext4
> > > > > came, implemented delayed allocation and somewhat changed how journalling
> > > > > works and suddently userspace programmers were caught by surprise their code
> > > > > working by luck stopped working. And they were complaining that when their
> > > > > code worked without fsync(2) before, it should work after it as well. So it
> > > > > took a lot of explaining that their applications are broken and Ted has
> > > > > also implemented some workarounds to make at least the most common mistakes
> > > > > silently fixed by the kernel most of the time.
> > > > >
> > > > > So I'm against providing 90% data integrity guarantees because there'll be
> > > > > many people who'll think they can get away with it and then complain when
> > > > > they won't once our implementation changes.
> > > > >
> > > > > Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
> > > > > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
> > > > > write-for-data-integrity.
> > > >
> > > > OK. I do agree that manpage is misleading and that the language describing
> > > > this flag combination should be toned down. I do not object to adding more
> > > > and more disclaimers about this API not being a data integrity API.
> > >
> > > I don't think we need more disclaimers, I'd just reword that
> > > "write-for-data-integrity" bit.
> > >
> > > > But the requirement I have is a real life workload and fdatasync(2) is not at
> > > > all a viable option when application is creating many files per second.
> > > >
> > > > I need to export the functionality of data writeback to userspace.
> > > > Objecting to expose this functionality via the interface that has been
> > > > documented to expose it for years and used to expose it in the
> > > > past (until the Fixes commit) is a bit weird IMO, even though I do
> > > > completely relate to your concern.
> > > >
> > > > I don't mind removing the "intended use case" part of commit message
> > > > if that helps reducing the chance that people misunderstand
> > > > the guaranties of the API.
> > > >
> > > > Another thing I could do is introduce a new flag for sync_file_range()
> > > > that will really mean what I want it to mean (all data will be written back
> > > > and all related inode metadata changes will be committed to journal
> > > > before the next inode metadata change is committed). For the sake of
> > > > argument let's call it SYNC_FILE_RANGE_DATA_DEPENDENCY.
> > >
> > > So there are two parts to your requirements:
> > >
> > > 1) Data writeback for all file pages has been submitted (+ completed).
> > > 2) What happens with related metadata.
> > >
> > > sync_file_range() is fine for 1) although I agree that WB_SYNC_NONE mode
> > > can still end up skipping some pages due to unrelated writeback / lock
> > > contention so having sync_file_range() mode doing WB_SYNC_ALL writeback
> > > makes some sense to me.
> > >
> >
> > OK. does it make enough sense for you to ACK my patch if the
> > commit message was toned down to:
> >
> > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > writeback") claims that sync_file_range(2) syscall was "created for
> > userspace to be able to issue background writeout and so waiting for
> > in-flight IO is undesirable there" and changes the writeback (back) to
> > WB_SYNC_NONE.
> >
> > This claim is only partially true. It is true for users that use the flag
> > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > the reason for changing to WB_SYNC_NONE writeback.
> >
> > However, that claim is not true for users that use that flag combination
> > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
> > Those users explicitly requested to wait for in-flight IO as well as to
> > writeback of dirty pages.
> >
> > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
> > and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > writeback, to perform the range sync request.
> >
> > Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
> > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>
>
> Yes, I'd be fine with that.

Great.
I'll send V2 with your ACK.
I guess since Andrew merged the Fixes commit, I'll point to patch at
him as well.

>
> > > What happens with 2) is pretty much up to the filesystem. You are speaking
> > > about the journal but that is very much a filesystem specific detail.
> > > What if the filesystem doesn't have journal at all? E.g. ext2? We cannot
> > > really expose a generic API that silently fails to work based on fs
> > > internal details.
> > >
> > > I guess I fail to see what kind of guarantees from the fs you're looking
> > > for? When you speak about journal, I guess you're looking for some kind of
> > > ordering but it's hard to be sure about exact requirements. So can you
> > > please spell those out? Once we know requirements, we can speak about the
> > > API.
> > >
> >
> > The requirement is quite easy to explain with an example, less easy to
> > generalize. I illustrated the requirement with the example:
> > O_TMPFILE;write();DATA_BARRIER;linkat()
> >
> > The requirement of the DATA_BARRIER operation is that if the effects
> > of linkat() are observed after crash, then the effects of write() must also
> > be observed after crash.
>
> OK, understood and I agree this makes sense to ask for... And you're not
> the only one. Application programmers often *assume* that if they do
> write(file_new), rename(file_new, file), they will either see old or new
> file contents after a crash. They are wrong but ext4 has heuristics to
> detect such behavior and at least initiate writeback on rename to make time
> window when data loss can happen smaller. Because there was lot of
> controversy around this when users were transitioning from ext3 (where this
> worked) to ext4. If there was API with decent performance to achieve this,
> I guess at least some applications would start using it.
>
> > At this point one may say that fdatasync(2) meets the requirements
> > of a DATA_BARRIER operation, so I need to introduce a non-requirement.
> > The non-requirement is that DATA_BARRIER operation does not need
> > to be a DATA_INTEGRITY operation, meaning that after crash, there
> > is no requirement to be able to find or access the written data.
>
> Well, I guess your other real requirement is good performance ;) How that's
> achieved is internal to the filesystem.
>
> > Now it should be clear to filesystem developers why the DATA_BARRIER
> > requirement is easier to meet with far less performance penalty than
> > fdatasync(2) on xfs and ext4.
>
> Well, it won't be free either but yes, you can save a transaction commit
> / disk cache flush compared to fdatasync(2).
>
> > Here is a clumsy attempt to generalize the guaranty of such operation,
> > without referring to internal fs implementation details:
> > "Make sure that all file data hits persistent storage before the next
> >  file metadata change hits persistent storage".
> >
> > Does that make sense to you?
> > Is the requirement clear?
> >
> > How does this sound for an API:
> > sync_file_range(fd, SYNC_FILE_RANGE_WRITE_DATA_BARRIER, 0, 0);
> >
> > which individual filesystems could support or EOPNOTSUPP
> > via the ->fsync() file_operation?
>
> Yes, this sounds workable to me but there are also other options - e.g.
> somehow pass the information to linkat() / rename() / whatever that the
> ordering is required. This might be easier for some filesystems to
> implement.

I like this idea. rename() and linkat() are definitely the most probable
metadata operation that needs this barrier. But I wouldn't want to limit
this capability to rename() and linkat() alone.
Overlayfs for example, could use a similar data write barrier before
remove xattr (when "hydrating" a metacopy upper).

> So it might be good to make a short session at LSF/MM about this
> so that developers of other filesystems can say what interface they would
> be able to implement...

Yeh, its on my short list for things to discuss.
[CC: Anna & Josef]

Thanks,
Amir.