Nick Piggin wrote: > > > sync_file_range would have to wait, then write, then wait. It also > > > does not call into the filesystem's ->fsync function, I don't know > > > what the wider consequences of that are for all filesystems, but > > > for some it means that metadata required to read back the data is > > > not synced properly, and often it means that metadata sync will not > > > work. > > > > fsync_range() must also wait, write, then wait again. > > > > The reason is this sequence of events: > > > > 1. App calls write() on a page, dirtying it. > > 2. Data writeout is initiated by usual kernel task. > > 3. App calls write() on the page again, dirtying it again. > > 4. App calls fsync_range() on the page. > > 5. ... Dum de dum, time passes ... > > 6. Writeout from step 2 completes. > > > > 7. fsync_range() initiates another writeout, because the > > in-progress writeout from step 2 might not include the changes from > > step 3. > > > > 7. fsync_range() waits for writout from step 7. > > 8. fsync_range() requests a device cache flush if needed (we hope!). > > 9. Returns to app. > > > > Therefore fsync_range() must wait for in-progress writeout to > > complete, before initiating more writeout and waiting again. > > That's only in rare cases where writeout is started but not completed > before we last dirty it and before we call the next fsync. I'd say in > most cases, we won't have to wait (it should often remain clean). Agreed it's rare. In those cases, sync_file_range() doesn't wait twice either. Both functions are the same in this part. > > This is the reason sync_file_range() has all those flags. As I said, > > the man page doesn't really explain how to use it properly. > > Well, one can read what the code does. Aside from that extra wait, There shouldn't be an extra wait. > and the problem of not syncing metadata, A bug. > one thing I dislike about it is that it exposes the new concept of > "writeout" to the userspace ABI. Previously all we cared about is > whether something is safe on disk or not. So I think it is > reasonable to augment the traditional data integrity APIs which will > probably be more easily used by existing apps. I agree entirely. Everyone knows what fsync_range() does, just from the name. Was there some reason, perhaps for performance or flexibility, for exposing the "writeout" concept to userspace? > > Also the kernel is in a better position to decide which order to do > > everything in, and how best to batch it. > > Better position than what? I proposed fsync_range (or fsyncv) to be > in-kernel too, of course. I mean the kernel is in a better position than userspace's lame attempts to call sync_file_range() in a clever way for optimal performance :-) > > Also, during the first wait (for in-progress writeout) the kernel > > could skip ahead to queuing some of the other pages for writeout as > > long as there is room in the request queue, and come back to the other > > pages later. > > Sure it could. That adds yet more complexity and opens possibility for > livelock (you go back to the page you were waiting for to find it was > since redirtied and under writeout again). Didn't you have a patch that fix a similar livelock against other apps in fsync()? I agree about the complexity. It's probably such a rare case. It must be handled correctly, though - two waits when needed, one wait usually. > > > > For database writes, you typically write a bunch of stuff in various > > > > regions of a big file (or multiple files), then ideally fdatasync > > > > some/all of the written ranges - with writes committed to disk in the > > > > best order determined by the OS and I/O scheduler. > > > > > > Do you know which databases do this? It will be nice to ask their > > > input and see whether it helps them (I presume it is an OSS database > > > because the "big" ones just use direct IO and manage their own > > > buffers, right?) > > > > I don't know if anyone uses sync_file_range(), or if it even works > > reliably, since it's not going to get much testing. > > The problem is that it is hard to verify. Even if it is getting lots > of testing, it is not getting enough testing with the block device > being shut off or throwing errors at exactly the right time. QEMU would be good for testing this sort of thing, but it doesn't sound like an easy test to write. > In 2.6.29 I just fixed a handful of data integrity and error reporting > bugs in sync that have been there for basically all of 2.6. Thank you so much! When I started work on a database engine, I cared about storage integrity a lot. I looked into fsync integrity on Linux and came out running because the smell was so bad. > > Take a look at this, though: > > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html > > > > "The results show fadvise + sync_file_range is on par or better than > > O_DIRECT. Detailed results are attached." > > That's not to say fsync would be any worse. And it's just a microbenchmark > anyway. In the end he was using O_DIRECT synchronously. You have to overlap O_DIRECT with AIO (the only time AIO on Linux really works) to get sensible performance. So ignore that result. > > By the way, direct I/O is nice but (a) not always possible, and (b) > > you don't get the integrity barriers, do you? > > It should. O_DIRECT can't do an I/O barrier after every write because performance would suck. Really badly. However, a database engine with any self-respect would want I/O barriers at certain points for data integrity. I suggest fdatasync() et al. should issue the barrier if there have been any writes, including O_DIRECT writes, since the last barrier. That could be a file-wide single flag "there have been writes since last barrier". > > fsync_range would remove those reasons for using separate files, > > making the database-in-a-single-file implementations more efficient. > > That is administratively much nicer, imho. > > > > Similar for userspace filesystem-in-a-file, which is basically the same. > > Although I think a large part is IOPs rather than data throughput, > so cost of fsync_range often might not be much better. IOPs are affected by head seeking. If the head is forced to seek between journal area and main data on every serial transaction, IOPs drops substantially. fsync_range() would reduce that seeking, for databases (and filesystems) which store both in the same file. > > > > For this, taking a vector of multiple ranges would be nice. > > > > Alternatively, issuing parallel fsync_range calls from multiple > > > > threads would approximate the same thing - if (big if) they aren't > > > > serialised by the kernel. > > > > > > I was thinking about doing something like that, but I just wanted to > > > get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc > > > could implement fsync_range on top of that? > > > > Rather than fsyncv, is there some way to separate the fsync into parts? > > > > 1. A sequence of system calls to designate ranges. > > 2. A call to say "commit and wait on all those ranges given in step 1". > > What's the problem with fsyncv? The problem with your proposal is that > it takes multiple syscalls and that it requires the kernel to build up > state over syscalls which is nasty. I guess I'm coming back to sync_file_range(), which sort of does that separation :-) Also, see the other mail, about the PostgreSQL folks wanting to sync optimally multiple files at once, not serialised. I don't have a problem with fsyncv() per se. Should it take a single file descriptor and list of file-ranges, or a list of file descriptors with ranges? The latter is more general, but too vectory without justification is a good way to get syscalls NAKd by Linus. In theory, pluggable Linux-AIO would be a great multiple-request submission mechanism. There's IOCB_CMD_FDSYNC (AIO request), just add IOCB_CMD_FDSYNC_RANGE. There's room under the hood of that API for batching sensibly, and putting the waits and barriers in the best places. But Linux-AIO does not have a reputation for actually working, though the API looks good in theory. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html