Re: [rfc] fsync_range?

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Wed, 21 Jan 2009 23:31:38 +0000

Bryan Henderson wrote:
> Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote on 01/21/2009 01:08:55 PM:
> 
> > For better or worse, I/O barriers and I/O flushes are the same thing
> > in the Linux block layer.  I've argued for treating them distinctly,
> > because there are different I/O scheduling opportunities around each
> > of them, but there wasn't much interest.
> 
> It's hard to see how they could be combined -- flushing (waiting for the 
> queue of writes to drain) is what you do -- at great performance cost -- 
> when you don't have barriers available.  The point of a barrier is to 
> avoid having the queue run dry.

Linux has a combined flush+barrier primitve in the block layer.
Actually it's not a primitive op, it's a flag on a write meaning "do
flush+barrier before and after this write", but that dates from fs
transaction commits, and isn't appropriate for fsync.

> Yes, it's the old performance vs integrity issue.  Drives long ago came 
> out with features to defeat operating system integrity efforts, in 
> exchange for performance, by doing write caching by default, ignoring 
> explicit demands to write through, etc.  Obviously, some people want that, 
> but I _have_ seen Linux developers escalate the battle for control of the 
> disk drive.  I can just never remember where it stands at any moment.

Last time I read about it, a few drives did it for a little while,
then they stopped doing it and such drives are rare, if they exist at
all, now.

Forget about "Linux battling for control".  Windows does this barrier
stuff too, as does every other major OS, and Microsoft documents it in
some depth.

Upmarket systems use battery-backed disk controllers of course, to get
speed and integrity together.  Or increasingly SSDs.

Certain downmarket (= cheapo) systems benefit noticably from the right
barriers.  Pull the power on a cheap Linux-based media player with a
hard disk inside, and if it's using ext3 with barriers off, expect
filesystem corruption from time to time.  I and others working on such
things have seen it.  With barriers on, never see any corruption.
This is with the cheapest small consumer disks you can find.

> But it doesn't matter in this discussion because my point is that if you 
> accept the performance hit for integrity (I suppose we're saying that in 
> current Linux, in some configurations, if a process does frequent fsyncs 
> of a file, every process writing to every drive that file touches will 
> slow to write-through speed), it will be about the same with 100 
> fsync_ranges in quick succession as for 1.

Write-through speed depends _heavily_ on head seeking with a
rotational disk.

100 fsync_ranges _for one commited app-level transaction_ is different
from a succession of 100 transactions to commit.  If an app requires
one transaction which happens to modify 100 different places in a
database file, you want those written in the best head seeking order.

> > A little?  It's the difference between letting the disk schedule 100
> > scattered writes itself, and forcing the disk to write them in the
> > order you sent them from userspace, aside from the doubling the rate
> > of device commands...
> 
> Again, in the scenario I'm talking about, all the writes were in the Linux 
> I/O queue before the first fsync_range() (thanks to fadvises) , so this 
> doesn't happen.

Maybe you're right about this. :-)
(Persuaded).

fadvise() which blocks is rather overloading the "hint" meaning of fadvise().
It could work though.

It smells more like sync_file_range(), where userspace is responsible
for deciding what order to submit the ranges in (because of the
blocking), than fsyncv(), where the kernel uses any heuristic it likes
including knowledge of filesystem block layout (higher level than
elevator, but lower level than plain file offset).

For userspace, that's not much different from what databases using
O_DIRECT have to do _already_.  They_ have to decide what order to
submit I/O ranges in, one range at a time, and with AIO they get about
the same amount of block elevator flexibility.  Which is exactly one
full block queue's worth of sorting at the head of a streaming pump of
file offsets.

So maybe the fadvise() method is ok...

It does mean two system calls per file range, though.  One fadvise()
per range to submit I/O, one fsync_range() to wait for all of it
afterwards.  That smells like sync_file_range() too.

Back to fsyncv() again?  Which does have the benefit of being easy to
understand too :-)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html