Re: [rfc] fsync_range?

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Wed, 21 Jan 2009 03:15:00 +0000

Nick Piggin wrote:
> > I like the idea.  It's much easier to understand than sync_file_range,
> > whose man page doesn't really explain how to use it correctly.
> > 
> > But how is fsync_range different from the sync_file_range syscall with
> > all its flags set?
> 
> sync_file_range would have to wait, then write, then wait. It also
> does not call into the filesystem's ->fsync function, I don't know
> what the wider consequences of that are for all filesystems, but
> for some it means that metadata required to read back the data is
> not synced properly, and often it means that metadata sync will not
> work.

fsync_range() must also wait, write, then wait again.

The reason is this sequence of events:

    1. App calls write() on a page, dirtying it.
    2. Data writeout is initiated by usual kernel task.
    3. App calls write() on the page again, dirtying it again.
    4. App calls fsync_range() on the page.
    5. ... Dum de dum, time passes ...
    6. Writeout from step 2 completes.

    7. fsync_range() initiates another writeout, because the
       in-progress writeout from step 2 might not include the changes from
       step 3.

    7. fsync_range() waits for writout from step 7.
    8. fsync_range() requests a device cache flush if needed (we hope!).
    9. Returns to app.

Therefore fsync_range() must wait for in-progress writeout to
complete, before initiating more writeout and waiting again.

This is the reason sync_file_range() has all those flags.  As I said,
the man page doesn't really explain how to use it properly.

An optimisation would be to detect I/O that's been queued on an
elevator, but where the page has not actually been read (i.e. no DMA
or bounce buffer copy done yet).  Most queued I/O presumably falls
into this category, and the second writeout would not be required.

But perhaps this doesn't happen much in real life?

Also the kernel is in a better position to decide which order to do
everything in, and how best to batch it.

Also, during the first wait (for in-progress writeout) the kernel
could skip ahead to queuing some of the other pages for writeout as
long as there is room in the request queue, and come back to the other
pages later.

> Filesystems could also much more easily get converted to a ->fsync_range
> function if that would be beneficial to any of them.
> 
> 
> > For database writes, you typically write a bunch of stuff in various
> > regions of a big file (or multiple files), then ideally fdatasync
> > some/all of the written ranges - with writes committed to disk in the
> > best order determined by the OS and I/O scheduler.
>  
> Do you know which databases do this? It will be nice to ask their
> input and see whether it helps them (I presume it is an OSS database
> because the "big" ones just use direct IO and manage their own
> buffers, right?)

I don't know if anyone uses sync_file_range(), or if it even works
reliably, since it's not going to get much testing.

I don't use it myself yet.  My interest is in developing (yet
another?)  high performance but reliable database engine, not an SQL
one though.  That's why I keep noticing the issues with fsync,
sync_file_range, barriers etc.

Take a look at this, though:

http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html

"The results show fadvise + sync_file_range is on par or better than
O_DIRECT. Detailed results are attached."

By the way, direct I/O is nice but (a) not always possible, and (b)
you don't get the integrity barriers, do you?

> Today, they will have to just fsync the whole file. So they first must
> identify which parts of the file need syncing, and then gather those
> parts as a vector.

Having to fsync the whole file is one reason that some databases use
separate journal files - so fsync only flushes the journal file, not
the big data file which can sometimes be more relaxed.

It's also a reason some databases recommend splitting the database
into multiple files of limited size - so the hit from fsync is reduced.

When a single file is used for journal and data (like
e.g. ext3-in-a-file), every transaction (actually coalesced set of
transactions) forces the disk head back and forth between two data
areas.  If the journal can be synced by itself, the disk head doesn't
need to move back and forth as much.

Identifying which parts to sync isn't much different than a modern
filesystem needs to do with its barriers, journals and journal-trees.
They have a lot in common.  This is bread and butter stuff for
database engines.

fsync_range would remove those reasons for using separate files,
making the database-in-a-single-file implementations more efficient.
That is administratively much nicer, imho.

Similar for userspace filesystem-in-a-file, which is basically the same.

> > For this, taking a vector of multiple ranges would be nice.
> > Alternatively, issuing parallel fsync_range calls from multiple
> > threads would approximate the same thing - if (big if) they aren't
> > serialised by the kernel.
> 
> I was thinking about doing something like that, but I just wanted to
> get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc
> could implement fsync_range on top of that?

Rather than fsyncv, is there some way to separate the fsync into parts?

   1. A sequence of system calls to designate ranges.
   2. A call to say "commit and wait on all those ranges given in step 1".

It seems sync_file_range() isn't _that_ far off doing that, except it
doesn't get the metadata right, as you say, and it doesn't have a
place for the I/O barrier either.

An additional couple of flags to sync_file_range() would sort out the
API:

   SYNC_FILE_RANGE_METADATA

      Commit the file metadata such as modification time and
      attributes.  Think fsync() versus fdatasync().

   SYNC_FILE_RANGE_IO_BARRIER

      Include a block device cache flush if needed, same as normal
      fsync() and fdatasync() are expected to.  The flag gives the syscall
      some flexibility to not do so. 

For the filesystem metadata, which you noticed is needed to access the
data on some filesystems, that should _always_ be committed.  Not
doing so is a bug in sync_file_range() to be fixed.

fdatasync() must commit the metadata needed to access the file data,
by the way.  In case it wasn't obvious. :-) This includes the file
size, if that's grown.  Many OSes have an O_DSYNC which is equivalent
to fdatasync() after each write, and is documented to write the inode
and other metadata needed to access flushed data if the file size has
increased.

With sync_file_range() fixed, all the other syscalls fsync(),
fdatasync() and fsync_range() could be implemented in terms of it -
possibly simplifying the code.  Maybe O_SYNC and O_DSYNC could use it
too.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html