Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 27 Jun 2023 15:47:57 +1000

On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote:
> Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when
> calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have
> just been spliced into a pipe. It appears that the fallocate call can zero
> out the pages that are sitting in the pipe buffer, before those pages are
> read from the pipe.
> 
> Simplified code excerpt (eliding error checking):
> 
> int fd = /* open file descriptor referring to some disk file */;
> for (off_t consumed = 0;;) {
>   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
>   if (n <= 0) break;
>   consumed += n;
>   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
> }

Huh. Never seen that pattern before - what are you trying to
implement with this?

> Expected behavior:
> Punching holes in a file after splicing pages out of that file into a pipe
> should not corrupt the spliced-out pages in the pipe buffer.

splice is a nasty, tricky beast that should never have been
inflicted on the world...

> Observed behavior:
> Some of the pages that have been spliced into the pipe get zeroed out by the
> subsequent fallocate call before they can be consumed from the read side of
> the pipe.

Which implies the splice is not copying the page cache pages but
simply taking a reference to them.

> 
> 
> Steps to reproduce:
> 
> 1. Save the attached ones.c, dontneed.c, and consume.c.
> 
> 2. gcc -o ones ones.c
>   gcc -o dontneed dontneed.c
>   gcc -o consume consume.c
> 
> 3. Fill a file with 32 MiB of 0xFF:
>   ./ones | head -c$((1<<25)) >testfile
>
> 4. Evict the pages of the file from the page cache:
>   sync testfile && ./dontneed testfile

To save everyone some time, this one liner:

# xfs_io -ft -c "pwrite -S 0xff 0 32M" -c fsync -c "fadvise -d 0 32M" testfile

Does the same thing as steps 3 and 4. I also reproduced it with much
smaller files - 1MB is large enough to see multiple corruption
events every time I've run it.

> 5. Splice the file into a pipe, punching out batches of pages after splicing
> them:
>   ./consume testfile | hexdump -C
> 
> The expected output from hexdump should show 32 MiB of 0xFF. Indeed, on my
> system, if I omit the POSIX_FADV_DONTNEED advice, then I do get the expected
> output. However, if the pages of the file are not already present in the
> page cache (i.e., if the splice call faults them in from disk), then the
> hexdump output shows some pages full of 0xFF and some pages full of 0x00.

Like so:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b0a000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b10000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b12000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b18000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b19000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b20000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b22000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b24000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b25000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b28000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b29000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b2c000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Hmmm. the corruption, more often than not, starts on a high-order
aligned file offset. Tracing indicates data is being populated in
the page cache by readahead, which would be using high-order folios
in XFS.

All the splice operations are return byte counts that are 4kB
aligned, so punch is doing filesystem block aligned punches. The
extent freeing traces indicate the filesystem is removing exactly
the right ranges from the file, and so the page cache invalidation
calls it is doing are also going to be for the correct ranges.

This smells of a partial high-order folio invalidation problem,
or at least a problem with splice working on pages rather than
folios the two not being properly coherent as a result of partial
folio invalidation.

To confirm, I removed all the mapping_set_large_folios() calls in
XFS, and the data corruption goes away. Hence, at minimum, large
folios look like a trigger for the problem.

Willy, over to you.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx