Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 27 Jun 2023 16:20:21 +1000

On Tue, Jun 27, 2023 at 03:47:57PM +1000, Dave Chinner wrote:
> On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote:
> > Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when
> > calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have
> > just been spliced into a pipe. It appears that the fallocate call can zero
> > out the pages that are sitting in the pipe buffer, before those pages are
> > read from the pipe.
> > 
> > Simplified code excerpt (eliding error checking):
> > 
> > int fd = /* open file descriptor referring to some disk file */;
> > for (off_t consumed = 0;;) {
> >   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
> >   if (n <= 0) break;
> >   consumed += n;
> >   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
> > }
> 
> Huh. Never seen that pattern before - what are you trying to
> implement with this?
> 
> > Expected behavior:
> > Punching holes in a file after splicing pages out of that file into a pipe
> > should not corrupt the spliced-out pages in the pipe buffer.
> 
> splice is a nasty, tricky beast that should never have been
> inflicted on the world...
> 
> > Observed behavior:
> > Some of the pages that have been spliced into the pipe get zeroed out by the
> > subsequent fallocate call before they can be consumed from the read side of
> > the pipe.
> 
> Which implies the splice is not copying the page cache pages but
> simply taking a reference to them.
> 
> > 
> > 
> > Steps to reproduce:
> > 
> > 1. Save the attached ones.c, dontneed.c, and consume.c.
> > 
> > 2. gcc -o ones ones.c
> >   gcc -o dontneed dontneed.c
> >   gcc -o consume consume.c
> > 
> > 3. Fill a file with 32 MiB of 0xFF:
> >   ./ones | head -c$((1<<25)) >testfile
> >
> > 4. Evict the pages of the file from the page cache:
> >   sync testfile && ./dontneed testfile
> 
> To save everyone some time, this one liner:
> 
> # xfs_io -ft -c "pwrite -S 0xff 0 32M" -c fsync -c "fadvise -d 0 32M" testfile
> 
> Does the same thing as steps 3 and 4. I also reproduced it with much
> smaller files - 1MB is large enough to see multiple corruption
> events every time I've run it.
> 
> > 5. Splice the file into a pipe, punching out batches of pages after splicing
> > them:
> >   ./consume testfile | hexdump -C
> > 
> > The expected output from hexdump should show 32 MiB of 0xFF. Indeed, on my
> > system, if I omit the POSIX_FADV_DONTNEED advice, then I do get the expected
> > output. However, if the pages of the file are not already present in the
> > page cache (i.e., if the splice call faults them in from disk), then the
> > hexdump output shows some pages full of 0xFF and some pages full of 0x00.
> 
> Like so:
> 
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b0a000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b10000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b12000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b18000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b19000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b20000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b22000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b24000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b25000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b28000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b29000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b2c000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 
> Hmmm. the corruption, more often than not, starts on a high-order
> aligned file offset. Tracing indicates data is being populated in
> the page cache by readahead, which would be using high-order folios
> in XFS.
> 
> All the splice operations are return byte counts that are 4kB
> aligned, so punch is doing filesystem block aligned punches. The
> extent freeing traces indicate the filesystem is removing exactly
> the right ranges from the file, and so the page cache invalidation
> calls it is doing are also going to be for the correct ranges.
> 
> This smells of a partial high-order folio invalidation problem,
> or at least a problem with splice working on pages rather than
> folios the two not being properly coherent as a result of partial
> folio invalidation.
> 
> To confirm, I removed all the mapping_set_large_folios() calls in
> XFS, and the data corruption goes away. Hence, at minimum, large
> folios look like a trigger for the problem.
> 
> Willy, over to you.

Just to follow up, splice ends up in the iov_iter code with
ITER_PIPE as the destination. We then end up with
copy_page_to_iter(), which if the iter is a pipe spits out to
copy_page_to_iter_pipe(). This does does not copy the page contents,
it simply grabs a reference to the page and then stuffs it into the
pipe buffer where it then sits until it is read.

IOWs, the splice to pipe operation is taking a reference to the
pagei cache page, then we drop all the locks that protect it, return
to userspace which then triggers an invalidation of the page
with a hole punch, and the data disappears from the page that is
referenced in the pipe.

So copy_page_to_iter_pipe() is simply broken. It's making the
assumption that it can just take a reference to a page cache page
and the state/contents of the page will not change until the page
is released (i.e. consumed by a pipe reader). This is not true
for file-backed pages - if the page is not locked, then anything
can happen to it while it is sitting on the pipe...

Why this requires large folios to trigger is something I don't
understand - the code looks broken even for single page objects in
the page cache. It's probably just a timing issue that high order
folios tickle much more easily with different readahead and
invalidation patterns.

Anyway, this needs iov_iter/folio expertise at this point...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx