Re: [rfc] fsync_range?

Nick Piggin <npiggin@xxxxxxx> · Wed, 21 Jan 2009 07:16:32 +0100

On Wed, Jan 21, 2009 at 05:24:01AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > That's only in rare cases where writeout is started but not completed
> > before we last dirty it and before we call the next fsync. I'd say in
> > most cases, we won't have to wait (it should often remain clean).
> 
> Agreed it's rare.  In those cases, sync_file_range() doesn't wait
> twice either.  Both functions are the same in this part.
> 
> > > This is the reason sync_file_range() has all those flags.  As I said,
> > > the man page doesn't really explain how to use it properly.
> > 
> > Well, one can read what the code does. Aside from that extra wait,
> 
> There shouldn't be an extra wait.

Of course there is becaues it has to wait on writeout of clean pages,
then writeout dirty pages, then wait on writeout of dirty pages.

> > one thing I dislike about it is that it exposes the new concept of
> > "writeout" to the userspace ABI.  Previously all we cared about is
> > whether something is safe on disk or not. So I think it is
> > reasonable to augment the traditional data integrity APIs which will
> > probably be more easily used by existing apps.
> 
> I agree entirely.
> 
> Everyone knows what fsync_range() does, just from the name.
> 
> Was there some reason, perhaps for performance or flexibility, for
> exposing the "writeout" concept to userspace?

I don't think I ever saw actual numbers to justify it. The async
writeout part of it I guess is one aspect, but one could just add
an async flag to fsync (like msync) to get mostly the same result.

> > > Also the kernel is in a better position to decide which order to do
> > > everything in, and how best to batch it.
> > 
> > Better position than what? I proposed fsync_range (or fsyncv) to be
> > in-kernel too, of course.
> 
> I mean the kernel is in a better position than userspace's lame
> attempts to call sync_file_range() in a clever way for optimal
> performance :-)

OK, agreed. In which case, fsyncv is a winner because you'd be able
to sync multiple files and multiple ranges within each file.

> > > Also, during the first wait (for in-progress writeout) the kernel
> > > could skip ahead to queuing some of the other pages for writeout as
> > > long as there is room in the request queue, and come back to the other
> > > pages later.
> > 
> > Sure it could. That adds yet more complexity and opens possibility for
> > livelock (you go back to the page you were waiting for to find it was
> > since redirtied and under writeout again).
> 
> Didn't you have a patch that fix a similar livelock against other apps
> in fsync()?

Well, that was more of "really slow progress". This could actually be
a real livelock because progress may never be made.

> > The problem is that it is hard to verify. Even if it is getting lots
> > of testing, it is not getting enough testing with the block device
> > being shut off or throwing errors at exactly the right time.
> 
> QEMU would be good for testing this sort of thing, but it doesn't
> sound like an easy test to write.
> 
> > In 2.6.29 I just fixed a handful of data integrity and error reporting
> > bugs in sync that have been there for basically all of 2.6.
> 
> Thank you so much!
> 
> When I started work on a database engine, I cared about storage
> integrity a lot.  I looked into fsync integrity on Linux and came out
> running because the smell was so bad.

I guess that abruptly shutting down the block device queue could be
used to pick up some bugs. That could be done using a real host and brd
quite easily.

The problem with some of those bugs I fixed is that some could take
quite a rare and transient situation before the window even opens for
possible data corruption. Then you have to crash the machine at that
time, and hope the pattern that was written out is in fact one that
will cause corruption.

I tried to write some debug infrastructure; basically putting sequence
counts in the struct page and going bug if the page is found to be
still dirty after the last fsync event but before the next dirty page
event... that kind of handles the simple case of the pagecache, but
not really the filesystem or block device parts of the equation, which
seem to be more difficult.

> > > Take a look at this, though:
> > > 
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html
> > > 
> > > "The results show fadvise + sync_file_range is on par or better than
> > > O_DIRECT. Detailed results are attached."
> > 
> > That's not to say fsync would be any worse. And it's just a microbenchmark
> > anyway.
> 
> In the end he was using O_DIRECT synchronously.  You have to overlap
> O_DIRECT with AIO (the only time AIO on Linux really works) to get
> sensible performance.  So ignore that result.

Ah OK.

> > > By the way, direct I/O is nice but (a) not always possible, and (b)
> > > you don't get the integrity barriers, do you?
> > 
> > It should.
> 
> O_DIRECT can't do an I/O barrier after every write because performance
> would suck.  Really badly.  However, a database engine with any
> self-respect would want I/O barriers at certain points for data integrity.

Hmm, I don't follow why that should be the case. Doesn't any self
respecting storage controller tell us the data is safe when it
hits its non volatile RAM?

> I suggest fdatasync() et al. should issue the barrier if there have
> been any writes, including O_DIRECT writes, since the last barrier.
> That could be a file-wide single flag "there have been writes since
> last barrier".

Well, I'd say the less that simpler applications have to care about,
the better. For Oracle and DB2 etc. I think we could have a mode that
turns off intermediate block device barriers and give them a syscall
or ioctl to issue the barrier manually. If that helps them significantly.

> > > fsync_range would remove those reasons for using separate files,
> > > making the database-in-a-single-file implementations more efficient.
> > > That is administratively much nicer, imho.
> > > 
> > > Similar for userspace filesystem-in-a-file, which is basically the same.
> > 
> > Although I think a large part is IOPs rather than data throughput,
> > so cost of fsync_range often might not be much better.
> 
> IOPs are affected by head seeking.  If the head is forced to seek
> between journal area and main data on every serial transaction, IOPs
> drops substantially.  fsync_range() would reduce that seeking, for
> databases (and filesystems) which store both in the same file.

OK I see your point. But that's not to say you couldn't have two
files or partitions laied out next to one another. But yes no
question that fsync_range is more flexible.>

> > What's the problem with fsyncv? The problem with your proposal is that
> > it takes multiple syscalls and that it requires the kernel to build up
> > state over syscalls which is nasty.
> 
> I guess I'm coming back to sync_file_range(), which sort of does that
> separation :-)
> 
> Also, see the other mail, about the PostgreSQL folks wanting to sync
> optimally multiple files at once, not serialised.
> 
> I don't have a problem with fsyncv() per se.  Should it take a single
> file descriptor and list of file-ranges, or a list of file descriptors
> with ranges?  The latter is more general, but too vectory without
> justification is a good way to get syscalls NAKd by Linus.

The latter, I think. It is indeed much more useful (you could sync
a hundred files and have them share a lot of the block device
flushes / barriers).

> In theory, pluggable Linux-AIO would be a great multiple-request
> submission mechanism.  There's IOCB_CMD_FDSYNC (AIO request), just add
> IOCB_CMD_FDSYNC_RANGE.  There's room under the hood of that API for
> batching sensibly, and putting the waits and barriers in the best
> places.  But Linux-AIO does not have a reputation for actually
> working, though the API looks good in theory.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html