On 03/11/2019 02.14, Andy Lutomirski wrote:
On Sat, Nov 2, 2019 at 4:10 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
On Sat, Nov 2, 2019 at 4:02 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
But I don't think anybody actually _did_ any of that. But that's
basically the argument for the three splice operations:
write/vmsplice/splice(). Which one you use depends on the lifetime and
the source of your data. write() is obviously for the copy case (the
source data might not be stable), while splice() is for the "data from
another source", and vmsplace() is "data is from stable data in my
vm".
Btw, it's really worth noting that "splice()" and friends are from a
more happy-go-lucky time when we were experimenting with new
interfaces, and in a day and age when people thought that interfaces
like "sendpage()" and zero-copy and playing games with the VM was a
great thing to do.
I suppose a nicer interface might be:
madvise(buf, len, MADV_STABILIZE);
(MADV_STABILIZE is an imaginary operation that write protects the
memory a la fork() but without the copying part.)
vmsplice_safer(fd, ...);
Where vmsplice_safer() is like vmsplice, except that it only works on
write-protected pages. If you vmsplice_safer() some memory and then
write to the memory, the pipe keeps the old copy.
But this can all be done with memfd and splice, too, I think.
Looks monstrous. This will kill all fun and profit. =)
I think vmsplice should at least deprecate and ignore SPLICE_F_GIFT.
It almost never works - if page still mapped then page_count in
generic_pipe_buf_steal() will be at least 2 (pte and pipe gup).
But if user munmap vma between splicing and consuming (and page not
stuck in lazy tlb and per-cpu vectors) then page from anon lru
could be spliced into file. Ouch.
And looks like fuse device still accepts SPLICE_F_MOVE.
It turns out that VM games are almost always more expensive than just
copying the data in the first place, but hey, people didn't know that,
and zero-copy was seen a big deal.
The reality is that almost nobody uses splice and vmsplice at all, and
they have been a much bigger headache than they are worth. If I could
go back in time and not do them, I would. But there have been a few
very special uses that seem to actually like the interfaces.
But it's entirely possible that we should kill vmsplice() (likely by
just implementing the semantics as "write()") because it's not common
enough to have the complexity.
I think this is the right choice.
FWIW, the openssl vmsplice() call looks dubious, but I suspect it's
okay because it's vmsplicing to a netlink socket, and the kernel code
on the other end won't read the data after it returns a response.
--Andy