Re: [PATCH] btrfs,vfs: allow FILE_EXTENT_SAME on a file opened ro

Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> · Sat, 28 May 2016 20:56:39 -0400

On Sun, May 29, 2016 at 02:21:03AM +0200, Adam Borowski wrote:
> On Fri, May 27, 2016 at 09:59:22PM -0400, Zygo Blaxell wrote:
> > I wonder if there is a risk of damaging files like this:
> > 
> > 	open A O_RDWR
> > 
> > 	open B O_RDONLY
> > 
> > 	copy B to A
> > 
> > 	do _not_ call fsync() here
> > 
> > 	dedup(A, B)
> > 
> > 	do _not_ call fsync() here either
> > 
> > 	some time passes
> > 
> > 	power fail
> > 
> > If A's extent isn't flushed to disk, what happens to B?  Does dedup imply
> > fsync or data ordering such that A is written to disk before the extent
> > ref in B is updated, or can the content of B change?  If root is running
> > dedup, we can assume that whoever authorized root code execution also made
> > sure that this case never arises (e.g. by ensuring the dedup agent always
> > calls fsync, or otherwise ensuring that the extent at A is stable on disk
> > before calling dedup).  If we allow non-root to do this then we have no
> > such assurance--but on the other hand maybe the assurance is weak and
> > bad things can happen here already even if we require root privilege.
> 
> I don't think this can happen on btrfs: the superblock is updated only after
> a barrier when both the data and extent refs are already on the disk.

If and only if the filesystem is mounted with the flushoncommit option,
that's true.  This is not the default, though, and I lost a fair amount
of time and data before I discovered this.

> > I also wonder what happens if an executable that has called
> > mlockall(MCL_FUTURE) gets its pages replaced by a deduped extent while
> > it's running...do the replaced pages become un-mlockall-ed?  Is the VFS
> > smart enough to swap in the new pages?  Am I just silly and insane for
> > expecting shared mmap() to have meaningful real-time properties on btrfs?
> 
> mmap() has currently no relation to extents, even entirely identical files
> don't share any pages:
> 
> dd if=/dev/urandom of=a bs=1048576 count=1024  # 1GB of junk
> cp --reflink a b
> time cat a >/dev/null  # all in page cache -- fast
> time cat b >/dev/null  # every extent is shared with a, but...

That's precisely my point.  If we've called mlockall(MCL_FUTURE), the
executable pages are locked into RAM.  If we then replace the underlying
physical pages, this would normally invalidate the page cache for the
replaced extent; however mlockall(MCL_FUTURE) and the exec() DENYWRITE
lock combined mean that such an invalidation should not be possible.
So what happens?

(now that I think of it, whatever happens would be the same thing that happens
during device delete or balance, so it's probably OK, maybe)

> In any case, this patch doesn't introduce any cases not already triggerable
> by root.

It allows non-root to trigger cases that previously could only be
triggered by root.
Attachment:
signature.asc

Description: Digital signature