On Tue, Jan 20, 2009 at 06:31:21PM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > Just wondering if we should add an fsync_range syscall like AIX and > > some BSDs have? It's pretty simple for the pagecache since it > > already implements the full sync with range syncs anyway. For > > filesystems and user programs, I imagine it is a bit easier to > > convert to fsync_range from fsync rather than use the sync_file_range > > syscall. > > > > Having a flags argument is nice, but AIX seems to use O_SYNC as a > > flag, I wonder if we should follow? > > I like the idea. It's much easier to understand than sync_file_range, > whose man page doesn't really explain how to use it correctly. > > But how is fsync_range different from the sync_file_range syscall with > all its flags set? sync_file_range would have to wait, then write, then wait. It also does not call into the filesystem's ->fsync function, I don't know what the wider consequences of that are for all filesystems, but for some it means that metadata required to read back the data is not synced properly, and often it means that metadata sync will not work. Filesystems could also much more easily get converted to a ->fsync_range function if that would be beneficial to any of them. > For database writes, you typically write a bunch of stuff in various > regions of a big file (or multiple files), then ideally fdatasync > some/all of the written ranges - with writes committed to disk in the > best order determined by the OS and I/O scheduler. Do you know which databases do this? It will be nice to ask their input and see whether it helps them (I presume it is an OSS database because the "big" ones just use direct IO and manage their own buffers, right?) Today, they will have to just fsync the whole file. So they first must identify which parts of the file need syncing, and then gather those parts as a vector. > For this, taking a vector of multiple ranges would be nice. > Alternatively, issuing parallel fsync_range calls from multiple > threads would approximate the same thing - if (big if) they aren't > serialised by the kernel. I was thinking about doing something like that, but I just wanted to get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc could implement fsync_range on top of that? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html