Re: [rfc] fsync_range?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chris Mason wrote:
> On Wed, 2009-01-21 at 09:12 -0500, Theodore Tso wrote:
> > On Wed, Jan 21, 2009 at 12:37:11PM +0000, Jamie Lokier wrote:
> > > 
> > > What about btrfs with data checksums?  Doesn't that count among
> > > data-retrieval metadata?  What about nilfs, which always writes data
> > > to a new place?  Etc.
> > > 
> > > I'm wondering what exactly sync_file_range() definitely writes, and
> > > what it doesn't write.
> > > 
> > > If it's just in use by Oracle, and nobody's sure what it does, that
> > > smacks of those secret APIs in Windows that made Word run a bit faster
> > > than everyone else's word processer...  sort of. :-)
> > 
> > Actually, I take that back; Oracle (and most other enterprise
> > databases; the world is not just Oracle --- there's also DB2, for
> > example) generally uses Direct I/O, so I wonder if they are using
> > sync_file_range() at all.
> 
> Usually if they don't use O_DIRECT, they use O_SYNC.

There's a case for using both together.

An O_DIRECT write convert to non-direct in some conditions.  When that
happens, you want the properties of O_SYNC.  It is documented to
happen on some other OSes - and maybe for VxFS on Linux.

Linux is nicer than some other platforms in returning EINVAL usually
for O_DIRECT whose alignment isn't satisfactory, but it can still fall
back to buffered I/O in some circumstances.  I think current kernels
do a sync in that case, but some earlier 2.6 kernels failed to.

Oh, you'd use O_DSYNC instead of course...  No point committing inode
updates all the time, only size increases, and most OSes document that
O_DSYNC does commit size increases.

By the way, emulators/VMs like QEMU and KVM use much the same methods
to access virtual disk images as databases do, for the same reasons.

> > I do wonder though how well or poorly Oracle will work on btrfs, or
> > indeed any filesystem that uses WAFL-like or log-structutred
> > filesystem-like algorithms.  Most of the enterprise databases have
> > been optimized for use on block devices and filesystems where you do
> > write-in-place acesses; and some enterprise databases do their own
> > data checksumming.  So if I had to guess, I suspect the answer to the
> > question I posed is "disastrously".  :-)
> 
> Yes, I think btrfs' nodatacow option is pretty important for database
> use.

Does O_DIRECT on btrfs still allocate new data blocks?
That's not very direct :-)

I'm thinking if O_DIRECT is set, considering what's likely to request
it, it may be reasonable for it to mean "overwrite in place" too
(except for files which are actually COW-shared with others of course).

> > After all, such db's
> > generally are happiest when the OS acts as a program loader than then
> > gets the heck out of the way of the filesystem, hence their use of
> > DIO.
> > 
> > Which again brings me back to the question --- I wonder who is
> > actually using sync_file_range, and what for?  I would assume it is
> > some database, most likely; so maybe we should check with MySQL or
> > Postgres?
> 
> Eric, didn't you have a magic script for grepping the sources/binaries
> in fedora for syscalls? 

sync_file_range does not appear anywhere in

    db-4.7.25
    mysql-dfsg-5.0.67
    postgresql-8.3.5
    sqlite3-3.5.9

(On Ubuntu; presumably the same in other distros).

-- Jamie

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux