Re: [PATCH] fsync_range, was: Re: munmap, msync: synchronization

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Wed, 23 Apr 2014 23:15:27 +0100

Christoph Hellwig wrote:
> On Tue, Apr 22, 2014 at 08:04:21AM +0100, Jamie Lokier wrote:
> > Hi Christoph,
> > 
> > Hardly research, I just did a quick Google and was surprised to find
> > some results.  AIX API differs from the BSDs; the BSDs seem to agree
> > with each other. fsync_range(), with a flag parameter saying what type
> > of sync, and whether it flushes the storage device write cache as well
> > (because they couldn't agree that was good - similar to the barriers
> > debate).
> 
> There is no FreeBSD implementation, I think you were confused by FreeBSD
> also hosting NetBSD man pages on their site, just as I initially was.

Yes, especially with the headings on the man pages saying FreeBSD :)
Just checked a FreeBSD 8.2 system, doesn't have it.

> The APIs are mostly the same, except that AIX reuses O_ flags as
> argument and NetBSD has a separate namespace.  Following the latter
> seems more sensible, and also allows developer to define the separate
> name to the O_ flag for portability.
...
> I've cooked up a patch, but I really need someone to test it and promote
> it.  Find the patch attached.  There are two differences to the NetBSD
> one:
> 
>  1) It doesn't fail for read-only FDs.  fsync doesn't, and while
>     standards used to have fdatasync and aio_fsync fail for them,
>     Linux never did and the standards are catching up:
> 
> 	http://austingroupbugs.net/view.php?id=501
> 	http://austingroupbugs.net/view.php?id=671

See also for maybe why:

        http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html

>  2) I don't implement the FDISKSYNC.  Requiring it is utterly broken,
>     and we wouldn't even have the infrastructure for it.  It might make
>     sense to provide it defined to 0 so that we have the identifier but
>     make it a no-op.

I presume Linux does the equivalent without needing FDISKSYNC, if and
only if the filesystem is mounted with barriers enabled, which is the
default nowadays?

> > In the kernel, I was always under the impression the simple part of
> > fsync_range - writing out data pages - was solved years ago, but being
> > sure the filesystem's updated its metadata in the proper way, that
> > begs for a little research into what filesystems do when asked,
> > doesn't it?
> 
> The filesystems I care about handle it fine, and while I don't know
> the details of others they better handle it properly, given that we
> use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
> from the nfs server.

Excellent.  This really looks like it should have gone in as a system
call years ago, since vfs_fsync_range was there all along waiting to
be used!

> Differences from NetBSD are:
> 
>  1) It doesn't fail for read-only FDs.  fsync doesn't, and while standards
>     used require fdatasync and aio_fsync to fail for read-only file
>     descriptors Linux never did and the standards are catching up:
> 
> 	http://austingroupbugs.net/view.php?id=501
> 	http://austingroupbugs.net/view.php?id=671
> 
>  2) It doesn't implement the FDISKSYNC.  Requiring a flag to actuall make
>     data persistant is completely broken, and the Linux infrastructure
>     doesn't support it anyway.  We could provide it as a no-op if we
>     really need to.

Ah, more differences, which I think should be dropped actually.

   3) Does not implement NetBSD's documented behaviour when length == 0.
      NetBSD says "If the length parameter is zero, fsync_range() will
      synchronize all of the file data".  This path does from offset.

   4) Other weird range stuff inherited from sync_file_range() on 32
      bit machines only.  May not be correct with O_DIRECT or
      filesystems that don't use page cache.

See:

> +static loff_t end_offset(loff_t offset, loff_t nbytes)
> +{
> +	loff_t endbyte = offset + nbytes;
> +
> +	if ((s64)offset < 0)
> +		return -EINVAL;
> +	if ((s64)endbyte < 0)
> +		return -EINVAL;
> +	if (endbyte < offset)
> +		return -EINVAL;
> +
> +	if (sizeof(pgoff_t) == 4) {
> +		if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
> +			/*
> +			 * The range starts outside a 32 bit machine's
> +			 * pagecache addressing capabilities.  Let it "succeed"
> +			 */
> +			return 0;
> +		}
> +		if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
> +			/*
> +			 * Out to EOF
> +			 */
> +			return LLONG_MAX;
> +		}
> +	}
> +
> +	if (nbytes == 0)
> +		endbyte = LLONG_MAX;
> +	else
> +		endbyte--;		/* inclusive */
> +
> +	return endbyte;
> +}

That was in sync_file_range(), where I think it might have made more
sense as that's obviously tied to the page cache only.  So:

    a) Giving zero length results in sync from offset..LLONG_MAX.
       (NetBSD would have it be 0..LLONG_MAX, according to man page.)

    b) If the offset is "too large" for page cache on a 32-bit machine,
       it won't do anything -- including no metadata side-effects.

    c) If the length is "too large" for page cache on a 32-bit machine,
       it extends the length to LLONG_MAX.

The desired behaviour with zero length, that's obviously a judgement
call.  I guess that provided NetBSD applications the option to use
FDISKSYNC without a range :)

About b) and c) they both look dubious, because it's not a given that
a filesystem is using page cache, or only using page cache.  For
example FUSE using O_DIRECT.  (Not that I've checked if you can
actually write anything in those ranges though.)

b) looks worse because it means side effects are also quietly not
done, and a file might legitimately not use the page cache (consider a
FUSE-mounted file accessed with O_DIRECT).

So, would it not make sense to just check the offset, length and
offset+length fit into s64; and if length is zero change the range to
0..LLONG_MAX, and simply match NetBSD that way?  (Or, call me crazy,
just return if length is zero.)

Best,
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html