Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote on 01/21/2009 02:30:03 PM: > > Getting back to I/O scheduled as a result of an fadvise(): if it blocks > > because the block queue is full, then it's going to block with a > > multi-range fsync_range() as well. > > No, why would it block? The block queue has room for (say) 100 small > file ranges. If you submit 1000 ranges, sure the first 900 may block, > then you've got 100 left in the queue. Yes, those are the blocks Nick mentioned. They're the same as with multi-range fsync_range(), in which the one system call submits 1000 ranges. > Then you call fsync_range() 1000 times, the first 900 are NOPs as you > say because the data has been written. The remaining 100 (size of the > block queue) are forced to write serially. They're even written to > the disk platter in order. I don't see why they would go serially or in any particular order. They're in the Linux queue in sorted, coalesced form and go down to the disk in batches for the drive to do its own coalescing and ordering. Same as with multi-range fsync_range(). The Linux I/O scheduler isn't going to wait for the forthcoming fsync_range() to start any I/O that's in its queue. >Linux has a combined flush+barrier primitve in the block layer. >Actually it's not a primitive op, it's a flag on a write meaning "do >flush+barrier before and after this write", I think you said that wrong, because a barrier isn't something you do. The flag says, "put a barrier before and after this write," and I think you're saying it also implies that as the barrier passes, Linux does a device flush (e.g. SCSI Synchronize Cache) command. That would make sense as a poor man's way of propagating the barrier into the device. SCSI devices have barriers too, but they would be harder for Linux to use than a Synchronize Cache, so maybe Linux doesn't yet. I can also see that it makes sense for fsync() to use this combination. I was confused before because both the device and Linux block layer have barriers and flushes and I didn't know which ones we were talking about. >> Yes, it's the old performance vs integrity issue. Drives long ago came >> out with features to defeat operating system integrity efforts, in >> exchange for performance, by doing write caching by default, ignoring >> explicit demands to write through, etc. Obviously, some people want that, >> but I _have_ seen Linux developers escalate the battle for control of the >> disk drive. I can just never remember where it stands at any moment. > ... >Forget about "Linux battling for control". Windows does this barrier >stuff too, as does every other major OS, and Microsoft documents it in >some depth. Not sure what you want me to forget about; you seem to be confirming that Linux, as well as all other OSes are engaged in this battle (with disk designers), and it seems like a natural state of engineering practice to me. >fadvise() which blocks is rather overloading the "hint" meaning of fadvise(). Yeah, I'm not totally comfortable with that either. I've been pretty much assuming that all the ranges from this database transaction generally fit in the I/O queue. I wonder what existing fadvise(FADV_DONTNEED) does, since Linux has the same "schedule the I/O right now" response to that. Just ignore the hint after the queue is full? >It does mean two system calls per file range, though. One fadvise() >per range to submit I/O, one fsync_range() to wait for all of it >afterwards. That smells like sync_file_range() too. > >Back to fsyncv() again? I said in another subthread that I don't think system call overhead is at all noticeable in a program that is doing device-synchronous I/O. Not enough to justify a fsyncv() like we justify readv(). Hey, here's a small example of how the flexibility of the single range fadvise plus single range fsync_range beats a multi-range fsyncv/fsync_range: Early in this thread, we noted the value of feeding multiple ranges of the file to the block layer at once for syncing. Then we noticed that it would also be useful to feed multiple ranges of multiple files, requiring a different interface. With the two system calls per range, that latter requirement was already met without thinking about it. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html