Re: [rfc] fsync_range?

Nick Piggin <npiggin@xxxxxxx> · Wed, 21 Jan 2009 12:41:22 +0100

On Wed, Jan 21, 2009 at 11:18:02AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > > > That's only in rare cases where writeout is started but not completed
> > > > before we last dirty it and before we call the next fsync. I'd say in
> > > > most cases, we won't have to wait (it should often remain clean).
> >
> > > There shouldn't be an extra wait. [in sync_file_range]
> > 
> > Of course there is becaues it has to wait on writeout of clean pages,
> > then writeout dirty pages, then wait on writeout of dirty pages.
> 
> Eh?  How is that different from the "only in rare cases where writeout
> is started but not completed" in your code?

No, in my code it is where writeout is started and the page has
been redirtied. If writeout has started and the page is still
clean (which should be the more common case of the two), then it
doesn't have to.

> Oh, let me guess.  sync_file_range() will wait for writeout to
> complete on pages where the dirty bit was cleared when they were
> queued for writout and have not been dirtied since, while
> fsync_range() will not wait for those?
> 
> I distinctly remember someone... yes, Andrew Morton, explaining why
> the double wait is needed for integrity.
> 
>     http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg272270.html
> 
> That's how I learned what (at least one person thinks) is the
> intended semantics of sync_file_range().

The double wait is needed by sync_file_range, firstly because the
first flag is simply defined to wait for writeout, but also because
the "start writeout" flag is defined not to writeout on pages
which are dirty but already under writeout.

> I'll just quote one line from Andrew's post:
> >> It's an interesting problem, with potentially high payback.
> 
> Back to that subtlety of waiting, and integrity.
> 
> If fsync_range does not wait at all on a page which is under writeout
> and clean (not dirtied since the writeout was queued), it will not
> achieve integrity.
> 
> That can happen due to the following events:
> 
>     1. App calls write(), dirties page.
>     2. Background dirty flushing starts writeout, clears dirty bit.
>     3. App calls fsync_range() on the page.
>     4. fsync_range() doesn't wait on it because it's clean.
>     5. Bang, app things the write is committed when it isn't.

No, because fsync_range still has to wait for writeout pages *after*
it has submitted dirty pages for writeout. This includes all pages,
not just ones it has submitted just now.

> I don't think the flags mean "wait on all writeouts" _then_ "initiate
> all dirty writeouts" _then_ "wait on all writeouts".

They do. It is explicitly stated and that is exactly how it is
implemented (except "initiate writeout against all dirty pages" is
"initiate writeout against all dirty pages not already under writeout").

> I think they mean *for each page in parallel* do that, or at least do
> its best with those constraints.
> 
> In other words, no double-waiting or excessive serialisation.

Well you can do it for each page in parallel, yes. This is what we
discussed about starting writeout against *other* pages if we find
a page under writeout that we have to wait for. And then coming back
to that page to process it. This opens the whole livelock and
complexity thing.

> Don't get me wrong, I think fsync_range() is a much cleaner idea, and
> much more likely to be used.
> 
> If fsync_range() is coming, it wouldn't do any harm, imho, to delete
> sync_file_range() completely, and replace it with a stub which calls
> fsync_range().  Or ENOSYS, then we'll find out if anyone used it :-)
> Your implementation will obviously be better, since all your kind
> attention to fsync integrity generally.

Well, given that postgresql post that they need to sync multiple
files, I think fsyncv is a nice way forward. It can be used to
implement fsync_range too, which is slightly portable.

> Andrew Morton did write, though:
> >>The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
> >>userspace can get as much data into the queue as possible, to permit the
> >>kernel to optimise IO scheduling better.
> 
> I wonder if there is something to that, or if it was just wishful
> thinking.

Another problem with that, precisely because it is tied up with
idea of writeout mixed with data integrity semantics, is that
the kernel is *not* free to do what it thinks best. It has to
start writeout on all those pages and not return until writeout
is started.

If the queue fills up it has to block. It cannot schedule a thread
to write out asynchronously, etc. Because userspace is directing
how the implementation should work rather than the high level
intention.

Andrew and I have well archived difference of opinion on this ;) I
have no interest in ripping out sync_file_range. I can't say it is
wrong or always going to be suboptimal. But I think it is fine to
extend the more traditional fsync APIs too.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html