Re: copy on write for splice() from file to pipe?

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 10 Feb 2023 17:19:53 +1100

On Thu, Feb 09, 2023 at 08:47:07PM -0800, Linus Torvalds wrote:
> On Thu, Feb 9, 2023 at 8:06 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >>
> > So while I was pondering the complexity of this and watching a great
> > big shiny rocket create lots of heat, light and noise, it occurred
> > to me that we already have a mechanism for preventing page cache
> > data from being changed while the folios are under IO:
> > SB_I_STABLE_WRITES and folio_wait_stable().
> 
> No, Dave. Not at all.
> 
> Stop and think.

I have.

> splice() is not some "while under IO" thing. It's *UNBOUNDED*.

Splice has two sides - a source where we splice to the transport
pipe, then a destination where we splice pages from the transport
pipe. For better or worse, time in the transport pipe is unbounded,
but that does not mean the srouce or destination have unbound
processing times.

However, transport times being unbound are largely irrelevant, and
miss the fact that the application does not require pages in transit
to be stable.

The application we are talking about here is file -> pipe -> network
stack for zero copy sending of static file data and the problem is
that the file pages are not stable whilst they are under IO in the
network stack.

IOWs, the application does not care if the data changes whilst they
are in transport attached to the pipe - it only cares that the
contents are stable once they have been delivered and are now wholly
owned by the network stack IO path so that the OTW encodings
(checksum, encryption, whatever) done within the network IO path
don't get compromised.

i.e. the file pages only need to be stable whilst the network stack
IO path checksums and DMAs the data to the network hardware.

That's exactly the same IO context that the block device stack
requires the page contents  to be stable - across parity/checksum
calculations and the subsequent DMA transfers to the storage
hardware.

I'm suggesting that the page should only need to be held stable
whilst it is under IO, whether that IO is in the network stack via
skbs or in the block device stack via bios.  Both network and block
IO are bounded by fixed time limits, both IO paths typically only
need pages held stable for a few milliseconds at a time, and both
have worst case IO times in error situations are typically bound at
a few minutes.

IOWs, splice is a complete misdirection here - it doesn't need to
know a thing about stable data requirements at all. It's the
destination processing that requires stable data, not the transport
mechanism.

Hence if we have a generic mechanism that the network stack can use
to detect a file backed page and mark it needing to be stable whilst
the network stack is doing IO on it, everything on the filesystem
side should just work like it does for pages under IO in the block
device stack...

Indeed, I suspect that a filesystem -> pipe -> filesystem zero copy
path via splice probably also needs stable source pages for some
filesystems, in which case we need exactly the same mechanism as
we need for stable pages in the network stack zero copy splice
destiantion path....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx