Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > On Wed, Jun 28, 2023 at 10:33:24AM +0100, David Howells wrote: > > Would you consider it a kernel bug, then, if you use sendmsg(MSG_ZEROCOPY) to > > send some data from a file mmapping that some other userspace then corrupts by > > altering the file before the kernel has managed to send it? > > I think there's a difference in that sendmsg() will block until it > returns (... right? Nope - sendmsg returns once the data is in the Tx queue. It tells you about the progress made in transmitting your zerocopy data by pasting error-report messages with SO_EE_ORIGIN_ZEROCOPY set into the msg_control buffer when you call recvmsg(). That tells you how many of the bytes you sent with MSG_ZEROCOPY have so far been transmitted. You're expected to work out from that which of your buffers are now freed up. Further, even if the process that issued the sendmsg() might be blocked, it doesn't mean that some other process can modify the contents of the buffer by write() or writing through a shared-writable mmap. > splice() returns. That implies the copy is done. I thought the whole point of splice was to *avoid* the copy. Or is it only to avoid user<->kernel copies? > Then the same thread modifies the file, but the pipe sees the new data, not > the old. Doesn't that feel like a bug to you? It can be argued either way. It you want to see it as a bug, then what solution do you want? (1) Always copy the data into the pipe. (2) Always unmap and steal the pages from the pagecache, copying if we can't. (3) R/O-protect any PTEs mapping those pages and implement CoW. (4) Disallow splice() from any region that's mmapped, disallow mmap() on or make page_mkwrite wait for any region that's currently spliced. Disallow fallocate() on or make fallocate() wait for any pages that are spliced. 2 and 3 are likely to have a performance degradation because we'll have to zap/modify PTEs before completing the splice. 4 will have some functionality degradation. 3 sounds particularly messy. The problem is that if a page in the pagecache is spliced into a pipe, we have to discard the page from the pagecache if we would otherwise want to modify it. I guess we'd need a splice counter adding to struct folio? (Or as least a PG_was_once_spliced bit). Maybe (2a) Steal the page if unmapped, unpinned and only in use by the pagecache, copy it otherwise? Maybe you have a better solution? David