Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE

David Howells <dhowells@xxxxxxxxxx> · Wed, 28 Jun 2023 10:33:24 +0100

Dave Chinner <david@xxxxxxxxxxxxx> wrote:

> On Wed, Jun 28, 2023 at 07:30:50AM +0100, David Howells wrote:
> > Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > > > Expected behavior:
> > > > > Punching holes in a file after splicing pages out of that file into
> > > > > a pipe should not corrupt the spliced-out pages in the pipe buffer.
> > 
> > I think this bit is the key.  Why would this be the expected behaviour?
> > As you say, splice is allowed to stuff parts of the pagecache into a pipe
> > and these may get transferred, say, to a network card at the end to
> > transmit directly from.  It's a form of direct I/O.

Actually, it's a form of zerocopy, not direct I/O.

> > If someone has the pages mmapped, they can change the data that will be
> > transmitted; if someone does a write(), they can change that data too.
> > The point of splice is to avoid the copy - but it comes with a tradeoff.
> 
> I wouldn't call "post-splice filesystem modifications randomly
> corrupts pipe data" a tradeoff. I call that a bug.

Would you consider it a kernel bug, then, if you use sendmsg(MSG_ZEROCOPY) to
send some data from a file mmapping that some other userspace then corrupts by
altering the file before the kernel has managed to send it?

Anyway, if you think the splice thing is a bug, we have to fix splice from a
buffered file that is shared-writably mmapped as well as fixing
fallocate()-driven mangling.  There are a number of options:

 (0) Document the bug as a feature: "If this is a problem, don't use splice".

 (1) Always copy the data into the pipe.

 (2) Always unmap and steal the pages from the pagecache, copying if we can't.

 (3) R/O-protect any PTEs mapping those pages and implement CoW.

 (4) Disallow splice() from any region that's mmapped, disallow mmap() on or
     make page_mkwrite wait for any region that's currently spliced.  Disallow
     fallocate() on or make fallocate() wait for any pages that are spliced.

With recent changes, I think there are only two places that need fixing:
filemap_splice_read() and shmem_splice_read().  However, I wonder what
performance effect of having to do a PTE hunt in splice() will be.

And then there's vmsplice()...

Also, I do wonder what happens if you do MSG_ZEROCOPY to a loopback network
address and then splice out of the other end.  I'm guessing you'll get the
zerocopied pages out into your pipe as I think it just moves the sent skbuffs
to the receive queue on the other end.

David