Re: [rfc] fsync_range?

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Thu, 22 Jan 2009 03:41:24 +0000

Bryan Henderson wrote:
> > No, why would it block?  The block queue has room for (say) 100 small
> > file ranges.  If you submit 1000 ranges, sure the first 900 may block,
> > then you've got 100 left in the queue.
> 
> Yes, those are the blocks Nick mentioned.  They're the same as with 
> multi-range fsync_range(),
> in which the one system call submits 1000 ranges.

Yes, except that fsync_range() theoretically has flexibility to order
them prior to the block queue with filesystem internal knowledge.  I
doubt if that would ever be implemented, but you never know.

> > Then you call fsync_range() 1000 times, the first 900 are NOPs as you
> > say because the data has been written.  The remaining 100 (size of the
> > block queue) are forced to write serially.  They're even written to
> > the disk platter in order.
> 
> I don't see why they would go serially or in any particular order.

You're right, please ignore my brain fart.
> >Linux has a combined flush+barrier primitve in the block layer.
> >Actually it's not a primitive op, it's a flag on a write meaning "do
> >flush+barrier before and after this write",
> 
> I think you said that wrong, because a barrier isn't something you do. The 
> flag says, "put a barrier before and after this write," and I think you're 
> saying it also implies that as the barrier passes, Linux does a device 
> flush (e.g. SCSI Synchronize Cache) command.  That would make sense as a 
> poor man's way of propagating the barrier into the device.  SCSI devices 
> have barriers too, but they would be harder for Linux to use than a 
> Synchronize Cache, so maybe Linux doesn't yet.

That's all correct.  Linux does a device flush on PATA if the device
write cache is enabled; I don't know if it does one on SCSI.  Two
flushes are done, before and after the flagged write I/O.  There's a
"softbarrier" aspect too: other writes cannot be reordered around
these I/Os, and on devices which accept overlapping commands, the
device queue is drained around the softbarrier.

On PATA that's really all you can do.  On SATA with NCQ, and on SCSI,
if the device accepts1 enough commands in flight at once, it's cheaper
to disable the device write cache.  It's a balancing act, depending on
how often you flush.  I don't think Linux has ever used the SCSI
barrier capabilities.

One other thing it can do is synchronous write, called FUA on SATA, so
flush+write+flush becomes flush+syncwrite.

The only thing Linux would gain from separating flush ops from barrier
ops in the block request queue, is different reordering and coalescing
opportunities.  It's not permitted to move writes in either direction
around a barrier, but it is permitted to move writes earlier past a
flush, and that may allow flushes to coalesce.

> Yeah, I'm not totally comfortable with that either.  I've been pretty much 
> assuming that all the ranges from this database transaction generally fit 
> in the I/O queue.

I wouldn't assume that.  It's legitimate to write gigabytes of data in
a transaction, then want to fsync it all before writing a commit
block.  Only about 1 x system RAM's worth of dirty data will need
flushing at that point :-)

> I said in another subthread that I don't think system call overhead is at 
> all noticeable in a program that is doing device-synchronous I/O.  Not 
> enough to justify a fsyncv() like we justify readv().

Btw, historically the justification for readv() was for sockets, not
files.  Separate reads don't work the same.

Yes, system call overhead is quite small.

But I saw recently on the QEMU list that they want to add preadv() and
pwritev() to Linux, because of the difference it makes to performance
compared with a sequence of pread() and pwrite() calls.

That surprises me.  (I wonder if they measured it).

fsync_range() does less work per-page than read/write.  In some
scenarios, fsync_range() is scanning over large numbers of pages as
quickly as possible, skipping the clean+not-writing pages.  I wonder
if that justifies fsyncv() :-)

> Hey, here's a small example of how the flexibility of the single range 
> fadvise plus single range fsync_range beats a multi-range 
> fsyncv/fsync_range:  Early in this thread, we noted the value of feeding 
> multiple ranges of the file to the block layer at once for syncing.  Then 
> we noticed that it would also be useful to feed multiple ranges of 
> multiple files, requiring a different interface.  With the two system 
> calls per range, that latter requirement was already met without thinking 
> about it.

That's why Nick proposed fsyncv take (file, start, length) tuples,
to sync multiple files :-)

If you do it the blocking-fadvise way, the blocking bits.

You'll block while feeding requests for the first file, until you get
started on the second, and so on.  No chance for parallelism -
e.g. what if the files are on different devices in btrfs? :-) (Same
for different extents in the same file actually).

That said, I'll be very surprised if fsyncv() is implemented smarter
than that.  As an API allowing the possibility, it sort of works
though.  Who knows, it might just pass the work on to btrfs or Tux3 to
optimise cleverly :-)

(That said #2, an AIO based API would _in principle_ provide yet more
freedom to convery what the app wants without overconstraining.
Overlap, parallelism, and not having to batch things up, but submit
them as needs come in.  Is that realistic?)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html