Re: [PATCH] btrfs/080: fix sporadic failures starting with kernel 6.13

Qu Wenruo <wqu@xxxxxxxx> · Sat, 1 Mar 2025 08:54:05 +1030

在 2025/3/1 08:16, Filipe Manana 写道:
On Fri, Feb 28, 2025 at 9:16 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:



在 2025/3/1 02:57, fdmanana@xxxxxxxxxx 写道:
From: Filipe Manana <fdmanana@xxxxxxxx>

Since kernel 6.13, namely since commit c87c299776e4 ("btrfs: make buffered
write to copy one page a time"), the test can sporadically fail with an
unexpected digests for files in snapshots. This is because after that
commit the pages for a buffered write are prepared and dirtied one by one,
and no longer in batches up to 512 pages (on x86_64)

Minor nitpicks, the original limit of pages is 8:

         nrptrs = max(nrptrs, 8);
         pages = kmalloc_array(nrptrs, sizeof(struct page *),
                                GFP_KERNEL);

No, it can be more than 8, much more. You're missing previous code
there, as nrptrs could be greater than 8 before the max expression:

nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE), PAGE_SIZE /
(sizeof(struct page *)));
nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
nrptrs = max(nrptrs, 8);
pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);

For example, for a 2M write, the first assignment to nrptrs results in 512.

My bad, it's max() not min(), thus it's indeed way larger.

And that commit indeed caused the huge behavior change ignoring the 
buffered write block size.

Thanks,
Qu




I'm not sure if it's a coincident or not, but the first 32K writes
matches the pages limit perfectly (for x86_64).

Meanwhile the second 32770 write is possible to be split even with the
older code, but it should be super rare to hit.

, so a snapshot that
is created in the middle of a buffered write can result in a version of
the file where only a part of a buffered write got caught, for example if
a snapshot is created while doing the 32K write for file range [0, 32K),
we can end up a file in the snapshot where only the first 8K (2 pages) of
the write are visible, since the remaining 24K were not yet processed by
the write task. Before that kernel commit, this didn't happen since we
processed all the pages in a single batch and while holding the whole
range locked in the inode's io tree.

This means no matter the buffered io buffer size, it's the filesystems'
behavior determining if such a buffered write will be split.

Maybe it's not that a huge deal, as the original behavior will also
break the buffered io block size, just with a larger value (8 pages
other than 1).


This is easy to observe by adding an "od -A d -t x1" call to the test
before the _fail statement when we get an unexpected file digest:

    $ ./check btrfs/080
    FSTYP         -- btrfs
    PLATFORM      -- Linux/x86_64 debian0 6.14.0-rc4-btrfs-next-188+ #1 SMP PREEMPT_DYNAMIC Wed Feb 26 17:38:41 WET 2025
    MKFS_OPTIONS  -- /dev/sdc
    MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

    btrfs/080 32s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad)
        --- tests/btrfs/080.out        2020-06-10 19:29:03.814519074 +0100
        +++ /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad 2025-02-27 17:12:08.410696958 +0000
        @@ -1,2 +1,6 @@
         QA output created by 080
        -Silence is golden
        +0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        +*
        +0008192
        +Unexpected digest for file /home/fdmanana/btrfs-tests/scratch_1/17_11_56_146646257_snap/foobar_50
        +(see /home/fdmanana/git/hub/xfstests/results//btrfs/080.full for details)
        ...
        (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/080.out /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad'  to see the entire diff)
    Ran: btrfs/080
    Failures: btrfs/080
    Failed 1 of 1 tests

The files are created like this while snapshots are created in parallel:

      run_check $XFS_IO_PROG -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          $SCRATCH_MNT/$name

So with the kernel behaviour before 6.13 we expected the snapshot to
contain any of the following versions of the file:

1) Empty file;

2) 32K file reflecting only the first buffered write;

3) A file with a size of 49154 bytes reflecting both buffered writes;

4) A file with a size of 90123 bytes, reflecting the truncate operation
     and both buffered writes.

So now the test can fail since kernel 6.13 due to snapshots catching
partial writes.

However the goal of the test when I wrote it was to verify that if the
snapshot version of a file gets the truncated size, then the buffered
writes are visible in the file, since they happened before the truncate
operation - that is, we can't get a file with a size of 90123 bytes that
doesn't have the range [0, 16K) full of bytes with a value of 0xaa and
the range [16K, 49154) full of bytes with the 0xbb value.

So update the test to the new kernel behaviour by verifying only that if
the file has the size we truncated to, then it has all the data we wrote
previously with the buffered writes.

Reported-by: Glass Su <glass.su@xxxxxxxx>
Link: https://lore.kernel.org/linux-btrfs/30FD234D-58FC-4F3C-A947-47A5B6972C01@xxxxxxxx/
Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>

Reviewed-by: Qu Wenruo <wqu@xxxxxxxx>

Thanks for the detailed analyze and the fix.

Thanks,
Qu

---
   tests/btrfs/080 | 42 +++++++++++++++++++++++-------------------
   1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/tests/btrfs/080 b/tests/btrfs/080
index ea9d09b0..aa97d3f6 100755
--- a/tests/btrfs/080
+++ b/tests/btrfs/080
@@ -32,6 +32,8 @@ _cleanup()

   _require_scratch_nocheck

+truncate_offset=90123
+
   create_snapshot()
   {
       local ts=`date +'%H_%M_%S_%N'`
@@ -48,7 +50,7 @@ create_file()
               -c "pwrite -S 0xaa -b 32K 0 32K" \
               -c "fsync" \
               -c "pwrite -S 0xbb -b 32770 16K 32770" \
-             -c "truncate 90123" \
+             -c "truncate $truncate_offset" \
               $SCRATCH_MNT/$name
   }

@@ -81,6 +83,12 @@ _scratch_mkfs "$mkfs_options" >>$seqres.full 2>&1

   _scratch_mount

+# Create a file while no snapshotting is in progress so that we get the expected
+# digest for every file in a snapshot that caught the truncate operation (which
+# sets the file's size to $truncate_offset).
+create_file "gold_file"
+expected_digest=`_md5_checksum "$SCRATCH_MNT/gold_file"`
+
   # Run some background load in order to make the issue easier to trigger.
   # Specially needed when testing with non-debug kernels and there isn't
   # any other significant load on the test machine other than this test.
@@ -103,24 +111,20 @@ for ((i = 0; i < $num_procs; i++)); do
   done

   for f in $(find $SCRATCH_MNT -type f -name 'foobar_*'); do
-     digest=`md5sum $f | cut -d ' ' -f 1`
-     case $digest in
-     "d41d8cd98f00b204e9800998ecf8427e")
-             # ok, empty file
-             ;;
-     "c28418534a020122aca59fd3ff9581b5")
-             # ok, only first write captured
-             ;;
-     "cd0032da89254cdc498fda396e6a9b54")
-             # ok, only 2 first writes captured
-             ;;
-     "a1963f914baf4d2579d643425f4e54bc")
-             # ok, the 2 writes and the truncate were captured
-             ;;
-     *)
-             # not ok, truncate captured but not one or both writes
-             _fail "Unexpected digest for file $f"
-     esac
+     file_size=$(stat -c%s "$f")
+     # We want to verify that if the file has the size set by the truncate
+     # operation, then both delalloc writes were flushed, and every version
+     # of the file in each snapshot has its range [0, 16K) full of bytes with
+     # a value of 0xaa and the range [16K, 49154) is full of bytes with a
+     # value of 0xbb.
+     if [ "$file_size" -eq "$truncate_offset" ]; then
+             digest=`_md5_checksum "$f"`
+             if [ "$digest" != "$expected_digest" ]; then
+                     echo -e "Unexpected content for file $f:\n"
+                     _hexdump "$f"
+                     _fail "Bad file content"
+             fi
+     fi
   done

   # Check the filesystem for inconsistencies.