On Fri, Feb 10, 2023 at 8:34 AM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Fri, Feb 10, 2023 at 7:15 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > > > Frankly, I really don't like having non-immutable data in a pipe. > > That statement is completely nonsensical. I know what splice() is. I'm trying to make the point that it may not be the right API for most (all?) of its use cases, that we can maybe do better, and that we should maybe even consider deprecating (and simplifying and the cost of performance) splice in the moderately near future. And I think I agree with you on most of what you're saying. > It was literally designed to be "look, we want zero-copy networking, > and we could do 'sendfile()' by mmap'ing the file, but mmap - and > particularly munmap - is too expensive, so we map things into kernel > buffers instead". Indeed. mmap() + sendfile() + munmap() is extraordinarily expensive and is not the right solution to zero-copy networking. > > So saying "I really don't like having non-immutable data in a pipe" is > complete nonsense. It's syntactically correct English, but it makes no > conceptual sense. > > You can say "I don't like 'splice()'". That's fine. I used to think > splice was a really cool concept, but I kind of hate it these days. > Not liking splice() makes a ton of sense. > > But given splice, saying "I don't like non-immutable data" really is > complete nonsense. I am saying exactly what I meant. Obviously mutable data exists. I'm saying that *putting it in a pipe* *while it's still mutable* is not good. Which implies that I don't think splice() is good. No offense. I am *not* saying that the mere existence of mutable data is a problem. > That's not something specific to "splice()". It's fundamental to the > whole *concept* of zero-copy. If you don't want copies, and the source > file changes, then you see those changes. Of course! A user program copying data from a file to a network fundamentally does this: Step 1: start the process. Step 2: data goes out to the actual wire or a buffer on the NIC or is otherwise in a place other than page cache, and the kernel reports completion. There are many ways to make this happen. Step 1 could be starting read() and step 2 could be send() returning. Step 1 could be be sticking something in an io_uring queue and step 2 could be reporting completion. Step 1 could be splice()ing to a pipe and step 2 could be a splice from the pipe to a socket completing (and maybe even later when the data actually goes out). *Obviously* any change to the file between steps 1 and 2 may change the data that goes out the wire. > So the data lifetime - even just on just one side - can _easily_ be > "multiple seconds" even when things are normal, and if you have actual > network connectivity issues we are easily talking minutes. True. But splice is extra nasty: step 1 happens potentially arbitrarily long before step 2, and the kernel doesn't even know which socket the data is destined for in step 1. So step 1 can't usefully return -EWOULDBLOCK, for example. And it's awkward for the kernel to report errors, because steps 1 and 2 are so disconnected. And I'm not convinced there's any corresponding benefit. In any case, maybe io_uring gives an opportunity to do much better. io_uring makes it *efficient* for largish numbers of long-running operations to all be pending at once. Would an API like this work better (very handwavy -- I make absolutely no promises that this is compatible with existing users -- new opcodes might be needed): Submit IORING_OP_SPLICE from a *file* to a socket: this tells the kernel to kindly send data from the file in question to the network. Writes to the file before submission will be reflected in the data sent. Writes after submission may or may not be reflected. (This is step 1 above.) The operation completes (and is reported in the CQ) only after the kernel knows that the data has been snapshotted (step 2 above). So completion can be reported when the data is DMAed out or when it's checksummed-and-copied or if the kernel decides to copy it for any other reason *and* the kernel knows that it won't need to read the data again for possible retransmission. As you said, this could easily take minutes, but that seems maybe okay to me. (And if Samba needs to make sure that future writes don't change the outgoing data even two seconds later when the data has been sent but not acked, then maybe a fancy API could be added to help, or maybe Samba shouldn't be using zero copy IO in the first place!) If the file is truncated or some other problem happens, the operation can fail. I don't know how easy or hard this is to implement, but it seems like it would be quite pleasant to *use* from user code, it ought to be even faster than splice-to-pipe-then-splice-to-socket (simply because there is less bookkeeping), and it doesn't seem like any file change tracking would be needed in the kernel. If this works and becomes popular enough, splice-from-file-to-pipe could *maybe* be replaced in the kernel with a plain copy. --Andy