Re: [rfc] fsync_range?

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Wed, 21 Jan 2009 11:18:02 +0000

Nick Piggin wrote:
> > > That's only in rare cases where writeout is started but not completed
> > > before we last dirty it and before we call the next fsync. I'd say in
> > > most cases, we won't have to wait (it should often remain clean).
>
> > There shouldn't be an extra wait. [in sync_file_range]
> 
> Of course there is becaues it has to wait on writeout of clean pages,
> then writeout dirty pages, then wait on writeout of dirty pages.

Eh?  How is that different from the "only in rare cases where writeout
is started but not completed" in your code?

Oh, let me guess.  sync_file_range() will wait for writeout to
complete on pages where the dirty bit was cleared when they were
queued for writout and have not been dirtied since, while
fsync_range() will not wait for those?

I distinctly remember someone... yes, Andrew Morton, explaining why
the double wait is needed for integrity.

    http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg272270.html

That's how I learned what (at least one person thinks) is the
intended semantics of sync_file_range().

I'll just quote one line from Andrew's post:
>> It's an interesting problem, with potentially high payback.

Back to that subtlety of waiting, and integrity.

If fsync_range does not wait at all on a page which is under writeout
and clean (not dirtied since the writeout was queued), it will not
achieve integrity.

That can happen due to the following events:

    1. App calls write(), dirties page.
    2. Background dirty flushing starts writeout, clears dirty bit.
    3. App calls fsync_range() on the page.
    4. fsync_range() doesn't wait on it because it's clean.
    5. Bang, app things the write is committed when it isn't.

On the other hand, if I've misunderstood and it will wait on that
page, but not twice, then I think it's the same as what
sync_file_range() is _supposed_ to do.

sync_file_range() is misunderstood.  Possibly due to the man page,
hand-waving and implementation.

I don't think the flags mean "wait on all writeouts" _then_ "initiate
all dirty writeouts" _then_ "wait on all writeouts".

I think they mean *for each page in parallel* do that, or at least do
its best with those constraints.

In other words, no double-waiting or excessive serialisation.

Don't get me wrong, I think fsync_range() is a much cleaner idea, and
much more likely to be used.

If fsync_range() is coming, it wouldn't do any harm, imho, to delete
sync_file_range() completely, and replace it with a stub which calls
fsync_range().  Or ENOSYS, then we'll find out if anyone used it :-)
Your implementation will obviously be better, since all your kind
attention to fsync integrity generally.

Andrew Morton did write, though:
>>The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
>>userspace can get as much data into the queue as possible, to permit the
>>kernel to optimise IO scheduling better.

I wonder if there is something to that, or if it was just wishful
thinking.

-- Jamie

doesn't want to share.

it's supposed to be 

> 
>  
> > > one thing I dislike about it is that it exposes the new concept of
> > > "writeout" to the userspace ABI.  Previously all we cared about is
> > > whether something is safe on disk or not. So I think it is
> > > reasonable to augment the traditional data integrity APIs which will
> > > probably be more easily used by existing apps.
> > 
> > I agree entirely.
> > 
> > Everyone knows what fsync_range() does, just from the name.
> > 
> > Was there some reason, perhaps for performance or flexibility, for
> > exposing the "writeout" concept to userspace?
> 
> I don't think I ever saw actual numbers to justify it. The async
> writeout part of it I guess is one aspect, but one could just add
> an async flag to fsync (like msync) to get mostly the same result.
>  
> 
> > > > Also the kernel is in a better position to decide which order to do
> > > > everything in, and how best to batch it.
> > > 
> > > Better position than what? I proposed fsync_range (or fsyncv) to be
> > > in-kernel too, of course.
> > 
> > I mean the kernel is in a better position than userspace's lame
> > attempts to call sync_file_range() in a clever way for optimal
> > performance :-)
> 
> OK, agreed. In which case, fsyncv is a winner because you'd be able
> to sync multiple files and multiple ranges within each file.
> 
>  
> > > > Also, during the first wait (for in-progress writeout) the kernel
> > > > could skip ahead to queuing some of the other pages for writeout as
> > > > long as there is room in the request queue, and come back to the other
> > > > pages later.
> > > 
> > > Sure it could. That adds yet more complexity and opens possibility for
> > > livelock (you go back to the page you were waiting for to find it was
> > > since redirtied and under writeout again).
> > 
> > Didn't you have a patch that fix a similar livelock against other apps
> > in fsync()?
> 
> Well, that was more of "really slow progress". This could actually be
> a real livelock because progress may never be made.
> 
> 
> > > The problem is that it is hard to verify. Even if it is getting lots
> > > of testing, it is not getting enough testing with the block device
> > > being shut off or throwing errors at exactly the right time.
> > 
> > QEMU would be good for testing this sort of thing, but it doesn't
> > sound like an easy test to write.
> > 
> > > In 2.6.29 I just fixed a handful of data integrity and error reporting
> > > bugs in sync that have been there for basically all of 2.6.
> > 
> > Thank you so much!
> > 
> > When I started work on a database engine, I cared about storage
> > integrity a lot.  I looked into fsync integrity on Linux and came out
> > running because the smell was so bad.
> 
> I guess that abruptly shutting down the block device queue could be
> used to pick up some bugs. That could be done using a real host and brd
> quite easily.
> 
> The problem with some of those bugs I fixed is that some could take
> quite a rare and transient situation before the window even opens for
> possible data corruption. Then you have to crash the machine at that
> time, and hope the pattern that was written out is in fact one that
> will cause corruption.
> 
> I tried to write some debug infrastructure; basically putting sequence
> counts in the struct page and going bug if the page is found to be
> still dirty after the last fsync event but before the next dirty page
> event... that kind of handles the simple case of the pagecache, but
> not really the filesystem or block device parts of the equation, which
> seem to be more difficult.
> 
>  
> > > > Take a look at this, though:
> > > > 
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html
> > > > 
> > > > "The results show fadvise + sync_file_range is on par or better than
> > > > O_DIRECT. Detailed results are attached."
> > > 
> > > That's not to say fsync would be any worse. And it's just a microbenchmark
> > > anyway.
> > 
> > In the end he was using O_DIRECT synchronously.  You have to overlap
> > O_DIRECT with AIO (the only time AIO on Linux really works) to get
> > sensible performance.  So ignore that result.
> 
> Ah OK.
> 
> 
> > > > By the way, direct I/O is nice but (a) not always possible, and (b)
> > > > you don't get the integrity barriers, do you?
> > > 
> > > It should.
> > 
> > O_DIRECT can't do an I/O barrier after every write because performance
> > would suck.  Really badly.  However, a database engine with any
> > self-respect would want I/O barriers at certain points for data integrity.
> 
> Hmm, I don't follow why that should be the case. Doesn't any self
> respecting storage controller tell us the data is safe when it
> hits its non volatile RAM?
> 
>  
> > I suggest fdatasync() et al. should issue the barrier if there have
> > been any writes, including O_DIRECT writes, since the last barrier.
> > That could be a file-wide single flag "there have been writes since
> > last barrier".
> 
> Well, I'd say the less that simpler applications have to care about,
> the better. For Oracle and DB2 etc. I think we could have a mode that
> turns off intermediate block device barriers and give them a syscall
> or ioctl to issue the barrier manually. If that helps them significantly.
> 
> 
> > > > fsync_range would remove those reasons for using separate files,
> > > > making the database-in-a-single-file implementations more efficient.
> > > > That is administratively much nicer, imho.
> > > > 
> > > > Similar for userspace filesystem-in-a-file, which is basically the same.
> > > 
> > > Although I think a large part is IOPs rather than data throughput,
> > > so cost of fsync_range often might not be much better.
> > 
> > IOPs are affected by head seeking.  If the head is forced to seek
> > between journal area and main data on every serial transaction, IOPs
> > drops substantially.  fsync_range() would reduce that seeking, for
> > databases (and filesystems) which store both in the same file.
> 
> OK I see your point. But that's not to say you couldn't have two
> files or partitions laied out next to one another. But yes no
> question that fsync_range is more flexible.>
> 
> 
> > > What's the problem with fsyncv? The problem with your proposal is that
> > > it takes multiple syscalls and that it requires the kernel to build up
> > > state over syscalls which is nasty.
> > 
> > I guess I'm coming back to sync_file_range(), which sort of does that
> > separation :-)
> > 
> > Also, see the other mail, about the PostgreSQL folks wanting to sync
> > optimally multiple files at once, not serialised.
> > 
> > I don't have a problem with fsyncv() per se.  Should it take a single
> > file descriptor and list of file-ranges, or a list of file descriptors
> > with ranges?  The latter is more general, but too vectory without
> > justification is a good way to get syscalls NAKd by Linus.
> 
> The latter, I think. It is indeed much more useful (you could sync
> a hundred files and have them share a lot of the block device
> flushes / barriers).
>  
> 
> > In theory, pluggable Linux-AIO would be a great multiple-request
> > submission mechanism.  There's IOCB_CMD_FDSYNC (AIO request), just add
> > IOCB_CMD_FDSYNC_RANGE.  There's room under the hood of that API for
> > batching sensibly, and putting the waits and barriers in the best
> > places.  But Linux-AIO does not have a reputation for actually
> > working, though the API looks good in theory.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html