Re: [rfc] fsync_range?

Nick Piggin <npiggin@xxxxxxx> · Wed, 21 Jan 2009 04:48:35 +0100

On Wed, Jan 21, 2009 at 03:15:00AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > > I like the idea.  It's much easier to understand than sync_file_range,
> > > whose man page doesn't really explain how to use it correctly.
> > > 
> > > But how is fsync_range different from the sync_file_range syscall with
> > > all its flags set?
> > 
> > sync_file_range would have to wait, then write, then wait. It also
> > does not call into the filesystem's ->fsync function, I don't know
> > what the wider consequences of that are for all filesystems, but
> > for some it means that metadata required to read back the data is
> > not synced properly, and often it means that metadata sync will not
> > work.
> 
> fsync_range() must also wait, write, then wait again.
> 
> The reason is this sequence of events:
> 
>     1. App calls write() on a page, dirtying it.
>     2. Data writeout is initiated by usual kernel task.
>     3. App calls write() on the page again, dirtying it again.
>     4. App calls fsync_range() on the page.
>     5. ... Dum de dum, time passes ...
>     6. Writeout from step 2 completes.
> 
>     7. fsync_range() initiates another writeout, because the
>        in-progress writeout from step 2 might not include the changes from
>        step 3.
> 
>     7. fsync_range() waits for writout from step 7.
>     8. fsync_range() requests a device cache flush if needed (we hope!).
>     9. Returns to app.
> 
> Therefore fsync_range() must wait for in-progress writeout to
> complete, before initiating more writeout and waiting again.

That's only in rare cases where writeout is started but not completed
before we last dirty it and before we call the next fsync. I'd say in
most cases, we won't have to wait (it should often remain clean).

> This is the reason sync_file_range() has all those flags.  As I said,
> the man page doesn't really explain how to use it properly.

Well, one can read what the code does. Aside from that extra wait,
and the problem of not syncing metadata, one thing I dislike about
it is that it exposes the new concept of "writeout" to the userspace
ABI.  Previously all we cared about is whether something is safe
on disk or not. So I think it is reasonable to augment the traditional
data integrity APIs which will probably be more easily used by
existing apps.

> An optimisation would be to detect I/O that's been queued on an
> elevator, but where the page has not actually been read (i.e. no DMA
> or bounce buffer copy done yet).  Most queued I/O presumably falls
> into this category, and the second writeout would not be required.
> 
> But perhaps this doesn't happen much in real life?

I doubt it would be worth the complexity. It would probably be pretty
fiddly and ugly change to the pagecache.

> Also the kernel is in a better position to decide which order to do
> everything in, and how best to batch it.

Better position than what? I proposed fsync_range (or fsyncv) to be
in-kernel too, of course.

> Also, during the first wait (for in-progress writeout) the kernel
> could skip ahead to queuing some of the other pages for writeout as
> long as there is room in the request queue, and come back to the other
> pages later.

Sure it could. That adds yet more complexity and opens possibility for
livelock (you go back to the page you were waiting for to find it was
since redirtied and under writeout again).

> > > For database writes, you typically write a bunch of stuff in various
> > > regions of a big file (or multiple files), then ideally fdatasync
> > > some/all of the written ranges - with writes committed to disk in the
> > > best order determined by the OS and I/O scheduler.
> >  
> > Do you know which databases do this? It will be nice to ask their
> > input and see whether it helps them (I presume it is an OSS database
> > because the "big" ones just use direct IO and manage their own
> > buffers, right?)
> 
> I don't know if anyone uses sync_file_range(), or if it even works
> reliably, since it's not going to get much testing.

The problem is that it is hard to verify. Even if it is getting lots
of testing, it is not getting enough testing with the block device
being shut off or throwing errors at exactly the right time.

In 2.6.29 I just fixed a handful of data integrity and error reporting
bugs in sync that have been there for basically all of 2.6.

> I don't use it myself yet.  My interest is in developing (yet
> another?)  high performance but reliable database engine, not an SQL
> one though.  That's why I keep noticing the issues with fsync,
> sync_file_range, barriers etc.
> 
> Take a look at this, though:
> 
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html
> 
> "The results show fadvise + sync_file_range is on par or better than
> O_DIRECT. Detailed results are attached."

That's not to say fsync would be any worse. And it's just a microbenchmark
anyway.

> By the way, direct I/O is nice but (a) not always possible, and (b)
> you don't get the integrity barriers, do you?

It should. But I wasn't advocating it versus pagecache + syncing,
just wondering what databases could use fsyncv so we can see if
they can test.

> > Today, they will have to just fsync the whole file. So they first must
> > identify which parts of the file need syncing, and then gather those
> > parts as a vector.
> 
> Having to fsync the whole file is one reason that some databases use
> separate journal files - so fsync only flushes the journal file, not
> the big data file which can sometimes be more relaxed.
> 
> It's also a reason some databases recommend splitting the database
> into multiple files of limited size - so the hit from fsync is reduced.
> 
> When a single file is used for journal and data (like
> e.g. ext3-in-a-file), every transaction (actually coalesced set of
> transactions) forces the disk head back and forth between two data
> areas.  If the journal can be synced by itself, the disk head doesn't
> need to move back and forth as much.
> 
> Identifying which parts to sync isn't much different than a modern
> filesystem needs to do with its barriers, journals and journal-trees.
> They have a lot in common.  This is bread and butter stuff for
> database engines.
> 
> fsync_range would remove those reasons for using separate files,
> making the database-in-a-single-file implementations more efficient.
> That is administratively much nicer, imho.
> 
> Similar for userspace filesystem-in-a-file, which is basically the same.

Although I think a large part is IOPs rather than data throughput,
so cost of fsync_range often might not be much better.

> > > For this, taking a vector of multiple ranges would be nice.
> > > Alternatively, issuing parallel fsync_range calls from multiple
> > > threads would approximate the same thing - if (big if) they aren't
> > > serialised by the kernel.
> > 
> > I was thinking about doing something like that, but I just wanted to
> > get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc
> > could implement fsync_range on top of that?
> 
> Rather than fsyncv, is there some way to separate the fsync into parts?
> 
>    1. A sequence of system calls to designate ranges.
>    2. A call to say "commit and wait on all those ranges given in step 1".

What's the problem with fsyncv? The problem with your proposal is that
it takes multiple syscalls and that it requires the kernel to build up
state over syscalls which is nasty.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html