On Tuesday, 27 June 2023 06:31:56 EDT, Matthew Wilcox wrote:
Perhaps the problem is that splice() appears to copy, but really just takes the reference. Perhaps splice needs to actually copy if it sees a multi-page folio and isn't going to take all of it. I'm not an expert in splice-ology, so let's cc some people who know more about splice than I do.
As a naive userspace coder, my mental model of splice is that it takes references to the pages backing the input file/pipe and appends those references to the output pipe's queue. Crucially, my expectation is that the cloned pages are marked copy-on-write, such that if some other process subsequently modifies the pages, then those pages will be copied at the time of modification (i.e., copy-on-write), and the references in the spliced-into pipe buffer will still refer to the unmodified originals.
If splice is merely taking a reference without caring to implement copy-on-write semantics, then it's not just FALLOC_FL_PUNCH_HOLE that will cause problems. Indeed, *any* write to the input file in the range where pages were spliced into a pipe will impact the pages already in the pipe buffer. I'm sure that makes perfect sense to a kernel developer, but it violates the implied contract of splice, which is to act *as though* the pages had been *copied* into the pipe.
Obviously, I'm not asking for splice to actually copy the pages, as zero-copy semantics are a large part of the motivation for preferring splice in the first place, but I do believe that the pages should be made copy-on-write such that subsequent writes to those pages in the page cache, whether by write or mmap or fallocate or otherwise, will force a copy of the pages to be made if the copy-on-write reference count of the modified pages is non-zero. Note, I say, "copy-on-write reference count," as distinct from the ordinary reference count, as you would need to properly handle the case where there are multiple *shared* references to a page that *should* observe any changes made to it, while there may also be multiple *private* references to the page that *must not* observe any changes made to it.
Just a thought: if you're going to implement bi-level reference counting semantics, then you could make MMAP_PRIVATE actually take a snapshot of the mapped pages at the time of mapping. Currently, mmap(2) says, "It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region," so you have the latitude to make the private mapping truly isolated from subsequent changes in the backing file, using the same plumbing as you could use to isolate spliced pages.