Re: [PATCH] btrfs/080: fix sporadic failures starting with kernel 6.13

Filipe Manana <fdmanana@xxxxxxxxxx> · Fri, 28 Feb 2025 21:46:27 +0000

On Fri, Feb 28, 2025 at 9:16 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
>
>
> 在 2025/3/1 02:57, fdmanana@xxxxxxxxxx 写道:
> > From: Filipe Manana <fdmanana@xxxxxxxx>
> >
> > Since kernel 6.13, namely since commit c87c299776e4 ("btrfs: make buffered
> > write to copy one page a time"), the test can sporadically fail with an
> > unexpected digests for files in snapshots. This is because after that
> > commit the pages for a buffered write are prepared and dirtied one by one,
> > and no longer in batches up to 512 pages (on x86_64)
>
> Minor nitpicks, the original limit of pages is 8:
>
>         nrptrs = max(nrptrs, 8);
>         pages = kmalloc_array(nrptrs, sizeof(struct page *),
>                                GFP_KERNEL);

No, it can be more than 8, much more. You're missing previous code
there, as nrptrs could be greater than 8 before the max expression:

nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE), PAGE_SIZE /
(sizeof(struct page *)));
nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
nrptrs = max(nrptrs, 8);
pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL);

For example, for a 2M write, the first assignment to nrptrs results in 512.

>
> I'm not sure if it's a coincident or not, but the first 32K writes
> matches the pages limit perfectly (for x86_64).
>
> Meanwhile the second 32770 write is possible to be split even with the
> older code, but it should be super rare to hit.
>
> >, so a snapshot that
> > is created in the middle of a buffered write can result in a version of
> > the file where only a part of a buffered write got caught, for example if
> > a snapshot is created while doing the 32K write for file range [0, 32K),
> > we can end up a file in the snapshot where only the first 8K (2 pages) of
> > the write are visible, since the remaining 24K were not yet processed by
> > the write task. Before that kernel commit, this didn't happen since we
> > processed all the pages in a single batch and while holding the whole
> > range locked in the inode's io tree.
>
> This means no matter the buffered io buffer size, it's the filesystems'
> behavior determining if such a buffered write will be split.
>
> Maybe it's not that a huge deal, as the original behavior will also
> break the buffered io block size, just with a larger value (8 pages
> other than 1).
>
> >
> > This is easy to observe by adding an "od -A d -t x1" call to the test
> > before the _fail statement when we get an unexpected file digest:
> >
> >    $ ./check btrfs/080
> >    FSTYP         -- btrfs
> >    PLATFORM      -- Linux/x86_64 debian0 6.14.0-rc4-btrfs-next-188+ #1 SMP PREEMPT_DYNAMIC Wed Feb 26 17:38:41 WET 2025
> >    MKFS_OPTIONS  -- /dev/sdc
> >    MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
> >
> >    btrfs/080 32s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad)
> >        --- tests/btrfs/080.out        2020-06-10 19:29:03.814519074 +0100
> >        +++ /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad 2025-02-27 17:12:08.410696958 +0000
> >        @@ -1,2 +1,6 @@
> >         QA output created by 080
> >        -Silence is golden
> >        +0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
> >        +*
> >        +0008192
> >        +Unexpected digest for file /home/fdmanana/btrfs-tests/scratch_1/17_11_56_146646257_snap/foobar_50
> >        +(see /home/fdmanana/git/hub/xfstests/results//btrfs/080.full for details)
> >        ...
> >        (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/080.out /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad'  to see the entire diff)
> >    Ran: btrfs/080
> >    Failures: btrfs/080
> >    Failed 1 of 1 tests
> >
> > The files are created like this while snapshots are created in parallel:
> >
> >      run_check $XFS_IO_PROG -f \
> >          -c "pwrite -S 0xaa -b 32K 0 32K" \
> >          -c "fsync" \
> >          -c "pwrite -S 0xbb -b 32770 16K 32770" \
> >          -c "truncate 90123" \
> >          $SCRATCH_MNT/$name
> >
> > So with the kernel behaviour before 6.13 we expected the snapshot to
> > contain any of the following versions of the file:
> >
> > 1) Empty file;
> >
> > 2) 32K file reflecting only the first buffered write;
> >
> > 3) A file with a size of 49154 bytes reflecting both buffered writes;
> >
> > 4) A file with a size of 90123 bytes, reflecting the truncate operation
> >     and both buffered writes.
> >
> > So now the test can fail since kernel 6.13 due to snapshots catching
> > partial writes.
> >
> > However the goal of the test when I wrote it was to verify that if the
> > snapshot version of a file gets the truncated size, then the buffered
> > writes are visible in the file, since they happened before the truncate
> > operation - that is, we can't get a file with a size of 90123 bytes that
> > doesn't have the range [0, 16K) full of bytes with a value of 0xaa and
> > the range [16K, 49154) full of bytes with the 0xbb value.
> >
> > So update the test to the new kernel behaviour by verifying only that if
> > the file has the size we truncated to, then it has all the data we wrote
> > previously with the buffered writes.
> >
> > Reported-by: Glass Su <glass.su@xxxxxxxx>
> > Link: https://lore.kernel.org/linux-btrfs/30FD234D-58FC-4F3C-A947-47A5B6972C01@xxxxxxxx/
> > Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
>
> Reviewed-by: Qu Wenruo <wqu@xxxxxxxx>
>
> Thanks for the detailed analyze and the fix.
>
> Thanks,
> Qu
>
> > ---
> >   tests/btrfs/080 | 42 +++++++++++++++++++++++-------------------
> >   1 file changed, 23 insertions(+), 19 deletions(-)
> >
> > diff --git a/tests/btrfs/080 b/tests/btrfs/080
> > index ea9d09b0..aa97d3f6 100755
> > --- a/tests/btrfs/080
> > +++ b/tests/btrfs/080
> > @@ -32,6 +32,8 @@ _cleanup()
> >
> >   _require_scratch_nocheck
> >
> > +truncate_offset=90123
> > +
> >   create_snapshot()
> >   {
> >       local ts=`date +'%H_%M_%S_%N'`
> > @@ -48,7 +50,7 @@ create_file()
> >               -c "pwrite -S 0xaa -b 32K 0 32K" \
> >               -c "fsync" \
> >               -c "pwrite -S 0xbb -b 32770 16K 32770" \
> > -             -c "truncate 90123" \
> > +             -c "truncate $truncate_offset" \
> >               $SCRATCH_MNT/$name
> >   }
> >
> > @@ -81,6 +83,12 @@ _scratch_mkfs "$mkfs_options" >>$seqres.full 2>&1
> >
> >   _scratch_mount
> >
> > +# Create a file while no snapshotting is in progress so that we get the expected
> > +# digest for every file in a snapshot that caught the truncate operation (which
> > +# sets the file's size to $truncate_offset).
> > +create_file "gold_file"
> > +expected_digest=`_md5_checksum "$SCRATCH_MNT/gold_file"`
> > +
> >   # Run some background load in order to make the issue easier to trigger.
> >   # Specially needed when testing with non-debug kernels and there isn't
> >   # any other significant load on the test machine other than this test.
> > @@ -103,24 +111,20 @@ for ((i = 0; i < $num_procs; i++)); do
> >   done
> >
> >   for f in $(find $SCRATCH_MNT -type f -name 'foobar_*'); do
> > -     digest=`md5sum $f | cut -d ' ' -f 1`
> > -     case $digest in
> > -     "d41d8cd98f00b204e9800998ecf8427e")
> > -             # ok, empty file
> > -             ;;
> > -     "c28418534a020122aca59fd3ff9581b5")
> > -             # ok, only first write captured
> > -             ;;
> > -     "cd0032da89254cdc498fda396e6a9b54")
> > -             # ok, only 2 first writes captured
> > -             ;;
> > -     "a1963f914baf4d2579d643425f4e54bc")
> > -             # ok, the 2 writes and the truncate were captured
> > -             ;;
> > -     *)
> > -             # not ok, truncate captured but not one or both writes
> > -             _fail "Unexpected digest for file $f"
> > -     esac
> > +     file_size=$(stat -c%s "$f")
> > +     # We want to verify that if the file has the size set by the truncate
> > +     # operation, then both delalloc writes were flushed, and every version
> > +     # of the file in each snapshot has its range [0, 16K) full of bytes with
> > +     # a value of 0xaa and the range [16K, 49154) is full of bytes with a
> > +     # value of 0xbb.
> > +     if [ "$file_size" -eq "$truncate_offset" ]; then
> > +             digest=`_md5_checksum "$f"`
> > +             if [ "$digest" != "$expected_digest" ]; then
> > +                     echo -e "Unexpected content for file $f:\n"
> > +                     _hexdump "$f"
> > +                     _fail "Bad file content"
> > +             fi
> > +     fi
> >   done
> >
> >   # Check the filesystem for inconsistencies.
>
>