Re: After block device error, FICLONE and sync_file_range() make NULs, unlike read()

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 9 Nov 2022 08:47:15 -0800

On Tue, Nov 08, 2022 at 09:24:36AM -0800, Noah Misch wrote:
> Scenario: due to a block device error, the kernel fails to persist some file
> content.  Even so, read() always returns the file content accurately.

...reads of the source file?  Or the destination file?  Your script
doesn't explicitly sum the source file after the EIO, so I'm guessing
the “cat $broken_source | cat > $dest; sum $dest” demonstrates that the
page cache of the source file is still ok?

> The first FICLONE returns EIO, but every subsequent FICLONE or
> copy_file_range() operates as though the file were all zeros.

Note FICLONE != clone_file_range.  FICLONE flushes dirty data to disk
and reflinks blocks; clone_file_range has a multitude of behaviors
depending on fs.  On XFS, it first will try FICLONE before falling back
to pipe copies.

So the first thing is to figure out where the EIO comes from such that
FICLONE fails.  Kernel logs would help here, since a dirty data
writeback encountering a broken disk will cause FICLONE to fail with EIO
but leave the fs running.

The second thing would be to step through the system behavior call by
call -- what is the state of $dest after the FICLONE call but before the
c_f_r.  Is it an empty file with size zero?  Is the pagecache for the
source file still intact?  That will tell us if FICLONE is somehow
stamping out files full of zeroes, or if this is a weird c_f_r behavior.

I wrote the FICLONE behavior in XFS, and AFAIK there isn't a vector by
which the $dest file would end up being 16MB all zeroed (but hey, let's
simulate this behavior and check anyway).  However, c_f_r is ... a mess,
and I wouldn't be surprised if it did something whacky like that.

> How
> feasible is it change FICLONE
> and copy_file_range() such that they instead find the bytes that read() finds?
> 
> - Kernel is 6.0.0-1-sparc64-smp from Debian sid, running in a Solaris-hosted VM.
> 
> - The VM is gcc202 from https://cfarm.tetaneutral.net/machines/list/.
>   Accounts are available.
> 
> - The outcome is still reproducible in FICLONE issued two days after the

Are you issuing raw FICLONE calls, or cp with reflink again?

>   original block device error.  I haven't checked whether it survives a
>   reboot.
> 
> - The "sync" command did not help.
> 
> - The block device errors have been ongoing for years.  If curious, see
>   https://postgr.es/m/CA+hUKGKfrXnuyk0Z24m8x4_eziuC3kLSaCmEeKPO1DVU9t-qtQ@xxxxxxxxxxxxxx
>   for details.  (Fixing the sunvdc driver is out of scope for this thread.)
>   Other known symptoms are failures in truncate() and fsync().  The system has
>   been generally usable for applications not requiring persistence.  I saw the

<cough> Well I guess if you're going to tie my hands from the start...

>   FICLONE problem after the system updated coreutils from 8.32-4.1 to 9.1-1.
>   That introduced a "cp" that uses FICLONE.  My current workaround is to place
>   a "cp" in my PATH that does 'exec /usr/bin/cp --reflink=never "$@"'
> 
> 
> The trouble emerged at a "cp".  To capture more details, I replaced "cp" with
> "trace-cp" containing:
> 
>   sum "$1"
>   strace cp "$@" 2>&1 | sed -n '/^geteuid/,$p'
>   sum "$2"
> 
> Output from that follows.  FICLONE returns EIO.  "cp" then falls back to
> copy_file_range(), which yields an all-zeros file:
> 
>   47831 16384 pg_wal/000000030000000000000003
>   geteuid()                               = 1450
>   openat(AT_FDCWD, "/home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
>   fstatat64(AT_FDCWD, "pg_wal/000000030000000000000003", {st_mode=S_IFREG|0600, st_size=16777216, ...}, 0) = 0
>   openat(AT_FDCWD, "pg_wal/000000030000000000000003", O_RDONLY) = 4
>   fstatat64(4, "", {st_mode=S_IFREG|0600, st_size=16777216, ...}, AT_EMPTY_PATH) = 0
>   openat(AT_FDCWD, "/home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003", O_WRONLY|O_CREAT|O_EXCL, 0600) = 5
>   ioctl(5, BTRFS_IOC_CLONE or FICLONE, 4) = -1 EIO (Input/output error)
>   fstatat64(5, "", {st_mode=S_IFREG|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
>   fadvise64_64(4, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
>   copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = 16777216
>   copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = 0

Clearly c_f_r thought it had 16MB of *something* to copy here.  It would
be interesting to ftrace the xfs reflink calls to find out if it called
FICLONE a second time, or if it actually tried a pagecache copy.

>   close(5)                                = 0
>   close(4)                                = 0
>   _llseek(0, 0, [0], SEEK_CUR)            = 0
>   close(0)                                = 0
>   close(1)                                = 0
>   close(2)                                = 0
>   exit_group(0)                           = ?
>   +++ exited with 0 +++
>   00000 16384 /home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003
> 
> Subsequent FICLONE returns 0 and yields an all-zeros file.  Test script:
> 
>   set -x
>   broken_source=t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
>   dest=$HOME/tmp/discard
>   sum "$broken_source"
>   : 'FICLONE returns 0 and yields an all-zeros file'
>   strace cp --reflink=always "$broken_source" "$dest" 2>&1 | sed -n '/^geteuid/,$p'
>   sum "$dest"; rm "$dest"
>   : 'copy_file_range() returns 0 and yields an all-zeros file'
>   strace -e copy_file_range cat "$broken_source" >"$dest"
>   sum "$dest"; rm "$dest"
>   : 'read() gets the intended bytes'
>   cat "$broken_source" | cat >"$dest"
>   sum "$dest"; rm "$dest"
> 
> Test script output:
> 
>   + broken_source=t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
>   + dest=/home/nm/tmp/discard
>   + sum t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
>   49522 16384 t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
>   + : FICLONE returns 0 and yields an all-zeros file
>   + strace cp --reflink=always t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003 /home/nm/tmp/discard
>   + sed -n /^geteuid/,$p
>   geteuid()                               = 1450
>   openat(AT_FDCWD, "/home/nm/tmp/discard", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
>   fstatat64(AT_FDCWD, "t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003", {st_mode=S_IFREG|0600, st_size=16777216, ...}, 0) = 0
>   openat(AT_FDCWD, "t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003", O_RDONLY) = 3
>   fstatat64(3, "", {st_mode=S_IFREG|0600, st_size=16777216, ...}, AT_EMPTY_PATH) = 0
>   openat(AT_FDCWD, "/home/nm/tmp/discard", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
>   ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = 0
>   close(4)                                = 0
>   close(3)                                = 0
>   _llseek(0, 0, 0x7feffddf1c0, SEEK_CUR)  = -1 ESPIPE (Illegal seek)
>   close(0)                                = 0
>   close(1)                                = 0
>   close(2)                                = 0
>   exit_group(0)                           = ?
>   +++ exited with 0 +++
>   + sum /home/nm/tmp/discard
>   00000 16384 /home/nm/tmp/discard

Curious.  This time the FICLONE actually succeeded.  Can you “filefrag
-v $dest” here and show us if the dest file has space mapped to that 16M
or if it's sparse?

Waitaminute.  About that 16M of data that's in $broken_source -- was all
of that written to the pagecache right before the first call to cp?  I
have a theory here that if you did:

	write(src_fd, <16M of data>) = 16M
	ficlone(dst_fd, src_fd) = -EIO
	copy_file_range(dst_fd, src_fd)

Then what's going on here is that the first write call dirties 16M of
pagecache.  The FICLONE does an implied fsync() to flush the dirty data
to disk, but that writeback fails.  The pagecache does not respond to
writeback failure by redirtying the pages.  After the failed FICLONE,
the src_fd file does not have written extents allocated to it -- either
it'll be an unwritten extent, or nothing at all.

Then the second c_f_r comes along and tries FICLONE again.  This time
the "flush" succeeds because the pages are "clean" so we go ahead with
the reflink.  (Thanks, cp, for dropping the EIO!!!)  Reflink only shares
written extents, so it extends dst_fd's size without mapping any real
space to it.  That's my theory for why you see a $dest file that's 16M
and all zeroes.

(Think of FICLONE as a low-level duplicator of disk contents whose only
interaction with the page cache is to flush it at the start.)

Then you come along and manually cat $broken_source to $dest, using that
odd pipe construction:

   cat "$broken_source" | cat >"$dest"

The introduction of the pipe means that c_f_r immediately goes to the
pipe copying fallback, which reads the pagecache to the pipe; and writes
the pipe contents to $dest.  Hence this works where everything else
fails.

So I guess the question now is, what do we do about it?  The pagecache
maintainers have never been receptive to redirtying pages after a
writeback failure; cp should really pass that EIO out to userspace
instead of silently eating it; and maaaybe FICLONE should detect EIOs
recorded in the file's pagecache and return that, but it won't fix the
underlying problem, which is that the cache thinks its clean after an
EIO, and the pagecache forgets about recorded EIOs after reporting them
via fsync/syncfs.

If you rebooted the machine at the end of the script, you'd likely see
that $broken_source is either empty or also full of zeroes.  Quite
possibly $dest would have the contents, since it did manage to get its
own copy of the "clean" pagecache data and persist it.

--D

>   + rm /home/nm/tmp/discard
>   + : copy_file_range() returns 0 and yields an all-zeros file
>   + strace -e copy_file_range cat t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
>   copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 16777216
>   copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 0
>   +++ exited with 0 +++
>   + sum /home/nm/tmp/discard
>   00000 16384 /home/nm/tmp/discard
>   + rm /home/nm/tmp/discard
>   + : read() gets the intended bytes
>   + cat t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
>   + cat
>   + sum /home/nm/tmp/discard
>   49522 16384 /home/nm/tmp/discard
>   + rm /home/nm/tmp/discard