Chris Mason wrote: > On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote: > > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote: > > > > > > What about btrfs with data checksums? Doesn't that count among > > > data-retrieval metadata? What about nilfs, which always writes data > > > to a new place? Etc. > > > > > > I'm wondering what exactly sync_file_range() definitely writes, and > > > what it doesn't write. > > > > > > If it's just in use by Oracle, and nobody's sure what it does, that > > > smacks of those secret APIs in Windows that made Word run a bit faster > > > than everyone else's word processer... sort of. :-) > > > > Actually, I take that back; Oracle (and most other enterprise > > databases; the world is not just Oracle --- there's also DB2, for > > example) generally uses Direct I/O, so I wonder if they are using > > sync_file_range() at all. > > Usually if they don't use O_DIRECT, they use O_SYNC. There's a case for using both together. An O_DIRECT write convert to non-direct in some conditions. When that happens, you want the properties of O_SYNC. It is documented to happen on some other OSes - and maybe for VxFS on Linux. Linux is nicer than some other platforms in returning EINVAL usually for O_DIRECT whose alignment isn't satisfactory, but it can still fall back to buffered I/O in some circumstances. I think current kernels do a sync in that case, but some earlier 2.6 kernels failed to. Oh, you'd use O_DSYNC instead of course... No point committing inode updates all the time, only size increases, and most OSes document that O_DSYNC does commit size increases. By the way, emulators/VMs like QEMU and KVM use much the same methods to access virtual disk images as databases do, for the same reasons. > > I do wonder though how well or poorly Oracle will work on btrfs, or > > indeed any filesystem that uses WAFL-like or log-structutred > > filesystem-like algorithms. Most of the enterprise databases have > > been optimized for use on block devices and filesystems where you do > > write-in-place acesses; and some enterprise databases do their own > > data checksumming. So if I had to guess, I suspect the answer to the > > question I posed is "disastrously". :-) > > Yes, I think btrfs' nodatacow option is pretty important for database > use. Does O_DIRECT on btrfs still allocate new data blocks? That's not very direct :-) I'm thinking if O_DIRECT is set, considering what's likely to request it, it may be reasonable for it to mean "overwrite in place" too (except for files which are actually COW-shared with others of course). > > After all, such db's > > generally are happiest when the OS acts as a program loader than then > > gets the heck out of the way of the filesystem, hence their use of > > DIO. > > > > Which again brings me back to the question --- I wonder who is > > actually using sync_file_range, and what for? I would assume it is > > some database, most likely; so maybe we should check with MySQL or > > Postgres? > > Eric, didn't you have a magic script for grepping the sources/binaries > in fedora for syscalls? sync_file_range does not appear anywhere in db-4.7.25 mysql-dfsg-5.0.67 postgresql-8.3.5 sqlite3-3.5.9 (On Ubuntu; presumably the same in other distros). -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html