On Wed, Jun 28, 2023 at 10:33:24AM +0100, David Howells wrote: > Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > On Wed, Jun 28, 2023 at 07:30:50AM +0100, David Howells wrote: > > > Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > Expected behavior: > > > > > > Punching holes in a file after splicing pages out of that file into > > > > > > a pipe should not corrupt the spliced-out pages in the pipe buffer. > > > > > > I think this bit is the key. Why would this be the expected behaviour? > > > As you say, splice is allowed to stuff parts of the pagecache into a pipe > > > and these may get transferred, say, to a network card at the end to > > > transmit directly from. It's a form of direct I/O. > > Actually, it's a form of zerocopy, not direct I/O. > > > > If someone has the pages mmapped, they can change the data that will be > > > transmitted; if someone does a write(), they can change that data too. > > > The point of splice is to avoid the copy - but it comes with a tradeoff. > > > > I wouldn't call "post-splice filesystem modifications randomly > > corrupts pipe data" a tradeoff. I call that a bug. > > Would you consider it a kernel bug, then, if you use sendmsg(MSG_ZEROCOPY) to > send some data from a file mmapping that some other userspace then corrupts by > altering the file before the kernel has managed to send it? Yes. We had this exact discussion a few months ago w.r.t. OTW checksums being invalid because data was changing mid-checksum. I consider that a bug, and in most cases this is avoided by the hardware checksum offloads. i.e. the checksum doesn't occur until after the data is DMAd to the hardware out of the referenced page so it is immune to tearing caused by 3rd party access to the data. That doesn't mean the data being delivered is valid or correct; fundamentally this is a use-after-free situation. > Anyway, if you think the splice thing is a bug, we have to fix splice from a > buffered file that is shared-writably mmapped as well as fixing > fallocate()-driven mangling. There are a number of options: > > (0) Document the bug as a feature: "If this is a problem, don't use splice". That's the primary issue right now - the behaviour is not documented anywhere. > > (1) Always copy the data into the pipe. We always copy non-pipe data. Only pipes are considered special here... > (2) Always unmap and steal the pages from the pagecache, copying if we can't. That's effectively the definition of SPLICE_F_GIFT behaviour, yes? > (3) R/O-protect any PTEs mapping those pages and implement CoW. Another mechanism for stable pages, but can we do CoW for kernel only mappings easily? > (4) Disallow splice() from any region that's mmapped, disallow mmap() on or > make page_mkwrite wait for any region that's currently spliced. Disallow > fallocate() on or make fallocate() wait for any pages that are spliced. fallocate() is already protected against splice(2) operations via exclusive i_rwsem, mapping->invalidate_lock, DIO drains and internal filesystem locking. The problem is that splicing to a pipe hands page cache pages to a pipe that is not under the control (or is even visible) to the filesystem, so the filesystem can do *nothing* to serialise against these anonymous references that the pipe holds to page caceh pages. So the filesystem would require a page/folio based synchronisation mechanism, kinda like writeback and stable pages, and we'd have to issue that wait everywhere we currently have a folio_wait_stable() IO barrier. fallocate() would also need this, too. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx