Christoph Hellwig wrote: > On Mon, Apr 21, 2014 at 10:34:18PM +0100, Jamie Lokier wrote: > > A ranged-fdatasync, for databases with little logs inside the big data > > file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any > > likelihood of that ever appearing in Linux? sync_file_range() comes > > with its Warning in the man page which basically means "don't trust me > > unless you know the filesystem exactly". > > We have the infrastructure for range fsync and fdatasync in the kernel, > it's just not exposed. Given that you've already done the research > how about you send a patch to wire it up? Do the above implementations > at least agree on an API for it? Hi Christoph, Hardly research, I just did a quick Google and was surprised to find some results. AIX API differs from the BSDs; the BSDs seem to agree with each other. fsync_range(), with a flag parameter saying what type of sync, and whether it flushes the storage device write cache as well (because they couldn't agree that was good - similar to the barriers debate). As for me doing it, no, sorry, I haven't touched the kernel in a few years, life's been complicated for non-technical reasons, and I don't have time to get back into it now. > sync_file_range() unfortunately only writes out pagecache data and never > the needed metadata to actually find it. While we could multiplex a > range fsync over it that seems to be very confusing (and would be more > complicated than just adding new syscalls) I agree. I never saw the point in sync_file_range() except to mislead, whereas fsync_range() always seemed obvious! In the kernel, I was always under the impression the simple part of fsync_range - writing out data pages - was solved years ago, but being sure the filesystem's updated its metadata in the proper way, that begs for a little research into what filesystems do when asked, doesn't it? For example, imagine two dirty pages 0 and 1, two disk blocks A and B, and a non-overwriting filesystem (similar to btrfs) which knows about the dirty flags and has formulated a plan to journal a single metadata change containing two pointers, from [0->A,1->B] to [0->C,1->D] when it flushes metadata _after_ pages 0 and 1 are written to new disk blocks C and D. And you do fsync_range just on block 1. Now if only page 1 gets written and page 0 does not, it's important that a different metadata change is journalled: [0->A,1->D] (or just [1->D]). Now hopefully, all filesystems are sane enough to just do that, by calculating what to journal as a response to only data I/O that's in flight and behind a barrier. But I wouldn't like to _assume_ that no filesystems algorithms don't queue up the joint [0->C,1-D] metadata change somehow, having seem the dirty flags, in a way that gets confused by a forced metadata flush after partial dirty data flush. After all it might be a legitimate thing to do in the current scheme. (Similar things apply to converting preallocated-but-unwritten regions to written.) So I have this weird idea that to do it carefully needs a little checking what filesystems do with carefully ordered block-pointer metadata writes. > > Then there's this, about Linux NFS incoherency with msync() and O_DIRECT: > > > > - https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ > > That mail is utterly confused. Yes, NFS has less coherency than normal > filesystems (google for close to open), but msync actually does it's > proper job on NFS. Good to know :) -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html