Re: [PATCH v2 6/6] vfs: fix sync_file_range syscall on an overlayfs file

Amir Goldstein <amir73il@xxxxxxxxx> · Mon, 27 Aug 2018 09:37:09 +0300

On Mon, Aug 27, 2018 at 7:25 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Mon, Aug 27, 2018 at 12:55:36AM +0300, Amir Goldstein wrote:
> > On Sun, Aug 26, 2018 at 10:34 PM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> > >
> > > On Sun, Aug 26, 2018 at 6:25 PM, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > For an overlayfs file/inode, page io is operating on the real underlying
> > > > file, so sync_file_range() should operate on the real underlying file
> > > > mapping to take affect.
> > >
> > > The man page tells us that this syscall basically gives no guarantees
> > > at all and shouldn't be used in portable programs.
> > >
> >
> > Oh no. You need to understand the context of this very bold warning.
> > The warning speaks lengthy about durability and it rightfully states that
> > you have no way of knowing what data will persist after crash.
> > This is relevant for application developers looking for durability, but that is
> > not the only use case for sync_file_range().
> >
> > I have an application using sync_file_range() for consistency, which is not
> > the same game as durability.
> >
> > They will tell you that the only safe way to guaranty consistency of data in a
> > new file is to do:
> > open(...O_TMPFILE) or open(TEMPFILE, ...)
> > write()
> > fsync()
> > link() or rename()
> >
> > Then you don't know if file will exist after crash, but if it will
> > exist its content
> > will be consistent.
> >
> > But the fact is that if you need to do many of those new file writes,
> > many fsync()
> > calls cost much more than the cost of syncing the inode pages, because every
> > new file writes metadata and metadata forces fsync to flush the journal.
> >
> > Amplify that times number of containers and you have every fsync() on every
> > file in every overlayfs container all slamming of the underlying fs journal.
> >
> > The fsync() in the snippet above can be safely replaced with sync_file_range()
> > eliminating all cost of excessive journal flushes without loosing any
> > consistency
> > guaranty on "strictly ordered metadata" filesystems - and all major filesystems
> > today are.
>
> Wrong.
>
> Nice story, but wrong.
>
> sync_file_range does this:
>
>         if (flags & SYNC_FILE_RANGE_WRITE) {
>                 ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
>                                                  WB_SYNC_NONE);
>         ......
>
> Note the use of "WB_SYNC_NONE"?
>
> This writeback type provides no guarantees that the entire range is
> written back.  Writeback can skip pages for any reason when it is
> set - to avoid blocking, lock contention, maybe complex allocation
> is required, etc. WB_SYNC_NONE doesn't even tag pages in
> write_cache_pages() so there's no way to ensure no pages are missed
> or retried when set_page_writeback_keepwrite() is called due to
> partial page writeback requiring another writeback call to the page
> to finish writeback. It doesn't try to write back newly dirty
> pages that are already under writeback. And so on.
>
> sync_file_range() provides *no guarantees* about getting your data
> to disk at all and /never has/.
>

Thanks for clarifying that!
I guess we'll need to go and re-fix concurrent _xfs_log_force()
optimization ;-/

> > > So, I'd just let the non-functionality be for now.   If someone
> > > complains of a regression (unlikely) we can look into it.
> >
> > I would like to place a complaint :-)
> >
> > I guess we could go for f_op->sync_ranges()?
>
> No. sync_file_range() needs to die.
>

I guess if we really wanted we could add a new FADV_WILLSYNC...
Anyway, I am withdrawing the complaint.

Thanks,
Amir.