On Tue, Jun 27, 2023 at 03:47:57PM +1000, Dave Chinner wrote: > On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote: > > Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when > > calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have > > just been spliced into a pipe. It appears that the fallocate call can zero > > out the pages that are sitting in the pipe buffer, before those pages are > > read from the pipe. > > > > Simplified code excerpt (eliding error checking): > > > > int fd = /* open file descriptor referring to some disk file */; > > for (off_t consumed = 0;;) { > > ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0); > > if (n <= 0) break; > > consumed += n; > > fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed); > > } > > Huh. Never seen that pattern before - what are you trying to > implement with this? > > > Expected behavior: > > Punching holes in a file after splicing pages out of that file into a pipe > > should not corrupt the spliced-out pages in the pipe buffer. > > splice is a nasty, tricky beast that should never have been > inflicted on the world... > > > Observed behavior: > > Some of the pages that have been spliced into the pipe get zeroed out by the > > subsequent fallocate call before they can be consumed from the read side of > > the pipe. > > Which implies the splice is not copying the page cache pages but > simply taking a reference to them. > > > > > > > Steps to reproduce: > > > > 1. Save the attached ones.c, dontneed.c, and consume.c. > > > > 2. gcc -o ones ones.c > > gcc -o dontneed dontneed.c > > gcc -o consume consume.c > > > > 3. Fill a file with 32 MiB of 0xFF: > > ./ones | head -c$((1<<25)) >testfile > > > > 4. Evict the pages of the file from the page cache: > > sync testfile && ./dontneed testfile > > To save everyone some time, this one liner: > > # xfs_io -ft -c "pwrite -S 0xff 0 32M" -c fsync -c "fadvise -d 0 32M" testfile > > Does the same thing as steps 3 and 4. I also reproduced it with much > smaller files - 1MB is large enough to see multiple corruption > events every time I've run it. > > > 5. Splice the file into a pipe, punching out batches of pages after splicing > > them: > > ./consume testfile | hexdump -C > > > > The expected output from hexdump should show 32 MiB of 0xFF. Indeed, on my > > system, if I omit the POSIX_FADV_DONTNEED advice, then I do get the expected > > output. However, if the pages of the file are not already present in the > > page cache (i.e., if the splice call faults them in from disk), then the > > hexdump output shows some pages full of 0xFF and some pages full of 0x00. > > Like so: > > 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 01b0a000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| > * > 01b10000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 01b12000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| > * > 01b18000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 01b19000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| > * > 01b20000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 01b22000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| > * > 01b24000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 01b25000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| > * > 01b28000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > * > 01b29000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| > * > 01b2c000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| > > Hmmm. the corruption, more often than not, starts on a high-order > aligned file offset. Tracing indicates data is being populated in > the page cache by readahead, which would be using high-order folios > in XFS. > > All the splice operations are return byte counts that are 4kB > aligned, so punch is doing filesystem block aligned punches. The > extent freeing traces indicate the filesystem is removing exactly > the right ranges from the file, and so the page cache invalidation > calls it is doing are also going to be for the correct ranges. > > This smells of a partial high-order folio invalidation problem, > or at least a problem with splice working on pages rather than > folios the two not being properly coherent as a result of partial > folio invalidation. > > To confirm, I removed all the mapping_set_large_folios() calls in > XFS, and the data corruption goes away. Hence, at minimum, large > folios look like a trigger for the problem. > > Willy, over to you. Just to follow up, splice ends up in the iov_iter code with ITER_PIPE as the destination. We then end up with copy_page_to_iter(), which if the iter is a pipe spits out to copy_page_to_iter_pipe(). This does does not copy the page contents, it simply grabs a reference to the page and then stuffs it into the pipe buffer where it then sits until it is read. IOWs, the splice to pipe operation is taking a reference to the pagei cache page, then we drop all the locks that protect it, return to userspace which then triggers an invalidation of the page with a hole punch, and the data disappears from the page that is referenced in the pipe. So copy_page_to_iter_pipe() is simply broken. It's making the assumption that it can just take a reference to a page cache page and the state/contents of the page will not change until the page is released (i.e. consumed by a pipe reader). This is not true for file-backed pages - if the page is not locked, then anything can happen to it while it is sitting on the pipe... Why this requires large folios to trigger is something I don't understand - the code looks broken even for single page objects in the page cache. It's probably just a timing issue that high order folios tickle much more easily with different readahead and invalidation patterns. Anyway, this needs iov_iter/folio expertise at this point... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx