Re: copy on write for splice() from file to pipe?

Andy Lutomirski <luto@xxxxxxxxxx> · Fri, 10 Feb 2023 11:55:45 -0800

On Fri, Feb 10, 2023 at 11:18 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Feb 10, 2023 at 11:02 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >
> > Second, either make splice more strict or add a new "strict splice"
> > variant.  Strict splice only completes when it can promise that writes
> > to the source that start after strict splice's completion won't change
> > what gets written to the destination.
>
> The thing ius, I think your "strict splice" is pointless and wrong.
>
> It's pointless, because it simply means that it won't perform well.
>
> And since the whole point of splice was performance, it's wrong.
>
> I really think the whole "source needs to be stable" is barking up the
> wrong tree.
>
> You are pointing fingers at splice().
>
> And I think that's wrong.
>
> We should point the fingers at either the _user_ of splice - as Jeremy
> Allison has done a couple of times - or we should point it at the sink
> that cannot deal with unstable sources.
>
> Because that whole "source is unstable" is what allows for that higher
> performance. The moment you start requiring stability, you _will_ lose
> it. You will have to lock the page, you'll have to umap it from any
> shared mappings, etc etc.  And even if there are no writers, or no
> current mappers, all that effort to make sure that is the case is
> actually fairly expensive.

...

> Because I really think that your "strict splice" model would just mean
> that now the kernel would have to add not just a memcpy, but also a
> new allocation for that new stable buffer for the memcpy, and that
> would all just be very very pointless.
>
> Alternatively, it would require some kind of nasty hard locking
> together with other limitations on what can be done by non-splice
> users.

I could be wrong, but I don't think any of this is necessary.  My
strict splice isn't intended to be any more stable than current splice
-- it's intended to complete more slowly and more informatively.  Now
maybe I'm wrong and the impleentation would be nasty, but I think that
the only bookkeeping needed is to arrange strict-splice to not
complete until the kernel is done with the source's page cache.  The
use of the source is refcounted already, and a bit of extra work might
be needed to track which strict-splice the reference came from, but
unless I've missed something, it's not crazy.

Looking at the current splice implementaiton, a splice that isn't
"strictly completed" is sort of represented by a struct pipe_buffer (I
think).  The actual implementation of strict-splice might consist of
separating pipe_buffer out from a pipe and adding an io_kiocb* and a
refcount to it.  Or maybe even just adding an io_kiocb* and making the
existing refcouting keep also track the io_kiocb*, but that might be
complicated.  This all boils down to tracking an actual splice all the
way through its lifecycle and not reporting it as done until it's all
the way done.  Anything else is icing on the cake, no?

There is absolutely no need to lock files or make page-cache pages
immutable or anything like that.

i think this is almost exactly what Jeremy and Stefan are asking for
re: notification when the system is done with a zero-copy send:

> What might be helpful in addition would be some kind of
notification that all pages are no longer used by the network
layer, IORING_OP_SENDMSG_ZC already supports such a notification,
maybe we can build something similar.