Re: [rfc] fsync_range?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote on 01/21/2009 02:30:03 PM:

> > Getting back to I/O scheduled as a result of an fadvise(): if it 
blocks 
> > because the block queue is full, then it's going to block with a 
> > multi-range fsync_range() as well.
> 
> No, why would it block?  The block queue has room for (say) 100 small
> file ranges.  If you submit 1000 ranges, sure the first 900 may block,
> then you've got 100 left in the queue.

Yes, those are the blocks Nick mentioned.  They're the same as with 
multi-range fsync_range(),
in which the one system call submits 1000 ranges.

> Then you call fsync_range() 1000 times, the first 900 are NOPs as you
> say because the data has been written.  The remaining 100 (size of the
> block queue) are forced to write serially.  They're even written to
> the disk platter in order.

I don't see why they would go serially or in any particular order. They're 
in the Linux queue in sorted, coalesced form and go down to the disk in 
batches for the drive to do its own coalescing and ordering.  Same as with 
multi-range fsync_range().  The Linux I/O scheduler isn't going to wait 
for the forthcoming fsync_range() to start any I/O that's in its queue.

>Linux has a combined flush+barrier primitve in the block layer.
>Actually it's not a primitive op, it's a flag on a write meaning "do
>flush+barrier before and after this write",

I think you said that wrong, because a barrier isn't something you do. The 
flag says, "put a barrier before and after this write," and I think you're 
saying it also implies that as the barrier passes, Linux does a device 
flush (e.g. SCSI Synchronize Cache) command.  That would make sense as a 
poor man's way of propagating the barrier into the device.  SCSI devices 
have barriers too, but they would be harder for Linux to use than a 
Synchronize Cache, so maybe Linux doesn't yet.

I can also see that it makes sense for fsync() to use this combination.  I 
was confused before because both the device and Linux block layer have 
barriers and flushes and I didn't know which ones we were talking about.

>> Yes, it's the old performance vs integrity issue.  Drives long ago came 

>> out with features to defeat operating system integrity efforts, in 
>> exchange for performance, by doing write caching by default, ignoring 
>> explicit demands to write through, etc.  Obviously, some people want 
that, 
>> but I _have_ seen Linux developers escalate the battle for control of 
the 
>> disk drive.  I can just never remember where it stands at any moment.

> ...

>Forget about "Linux battling for control".  Windows does this barrier
>stuff too, as does every other major OS, and Microsoft documents it in
>some depth.

Not sure what you want me to forget about; you seem to be confirming that 
Linux, as well as all other OSes are engaged in this battle (with disk 
designers), and it seems like a natural state of engineering practice to 
me.

>fadvise() which blocks is rather overloading the "hint" meaning of 
fadvise().

Yeah, I'm not totally comfortable with that either.  I've been pretty much 
assuming that all the ranges from this database transaction generally fit 
in the I/O queue.

I wonder what existing fadvise(FADV_DONTNEED) does, since Linux has the 
same "schedule the I/O right now" response to that.  Just ignore the hint 
after the queue is full?

>It does mean two system calls per file range, though.  One fadvise()
>per range to submit I/O, one fsync_range() to wait for all of it
>afterwards.  That smells like sync_file_range() too.
>
>Back to fsyncv() again?

I said in another subthread that I don't think system call overhead is at 
all noticeable in a program that is doing device-synchronous I/O.  Not 
enough to justify a fsyncv() like we justify readv().

Hey, here's a small example of how the flexibility of the single range 
fadvise plus single range fsync_range beats a multi-range 
fsyncv/fsync_range:  Early in this thread, we noted the value of feeding 
multiple ranges of the file to the block layer at once for syncing.  Then 
we noticed that it would also be useful to feed multiple ranges of 
multiple files, requiring a different interface.  With the two system 
calls per range, that latter requirement was already met without thinking 
about it.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Storage Systems

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux