On Fri, Feb 28, 2025 at 9:16 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote: > > > > 在 2025/3/1 02:57, fdmanana@xxxxxxxxxx 写道: > > From: Filipe Manana <fdmanana@xxxxxxxx> > > > > Since kernel 6.13, namely since commit c87c299776e4 ("btrfs: make buffered > > write to copy one page a time"), the test can sporadically fail with an > > unexpected digests for files in snapshots. This is because after that > > commit the pages for a buffered write are prepared and dirtied one by one, > > and no longer in batches up to 512 pages (on x86_64) > > Minor nitpicks, the original limit of pages is 8: > > nrptrs = max(nrptrs, 8); > pages = kmalloc_array(nrptrs, sizeof(struct page *), > GFP_KERNEL); No, it can be more than 8, much more. You're missing previous code there, as nrptrs could be greater than 8 before the max expression: nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE), PAGE_SIZE / (sizeof(struct page *))); nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied); nrptrs = max(nrptrs, 8); pages = kmalloc_array(nrptrs, sizeof(struct page *), GFP_KERNEL); For example, for a 2M write, the first assignment to nrptrs results in 512. > > I'm not sure if it's a coincident or not, but the first 32K writes > matches the pages limit perfectly (for x86_64). > > Meanwhile the second 32770 write is possible to be split even with the > older code, but it should be super rare to hit. > > >, so a snapshot that > > is created in the middle of a buffered write can result in a version of > > the file where only a part of a buffered write got caught, for example if > > a snapshot is created while doing the 32K write for file range [0, 32K), > > we can end up a file in the snapshot where only the first 8K (2 pages) of > > the write are visible, since the remaining 24K were not yet processed by > > the write task. Before that kernel commit, this didn't happen since we > > processed all the pages in a single batch and while holding the whole > > range locked in the inode's io tree. > > This means no matter the buffered io buffer size, it's the filesystems' > behavior determining if such a buffered write will be split. > > Maybe it's not that a huge deal, as the original behavior will also > break the buffered io block size, just with a larger value (8 pages > other than 1). > > > > > This is easy to observe by adding an "od -A d -t x1" call to the test > > before the _fail statement when we get an unexpected file digest: > > > > $ ./check btrfs/080 > > FSTYP -- btrfs > > PLATFORM -- Linux/x86_64 debian0 6.14.0-rc4-btrfs-next-188+ #1 SMP PREEMPT_DYNAMIC Wed Feb 26 17:38:41 WET 2025 > > MKFS_OPTIONS -- /dev/sdc > > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 > > > > btrfs/080 32s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad) > > --- tests/btrfs/080.out 2020-06-10 19:29:03.814519074 +0100 > > +++ /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad 2025-02-27 17:12:08.410696958 +0000 > > @@ -1,2 +1,6 @@ > > QA output created by 080 > > -Silence is golden > > +0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa > > +* > > +0008192 > > +Unexpected digest for file /home/fdmanana/btrfs-tests/scratch_1/17_11_56_146646257_snap/foobar_50 > > +(see /home/fdmanana/git/hub/xfstests/results//btrfs/080.full for details) > > ... > > (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/080.out /home/fdmanana/git/hub/xfstests/results//btrfs/080.out.bad' to see the entire diff) > > Ran: btrfs/080 > > Failures: btrfs/080 > > Failed 1 of 1 tests > > > > The files are created like this while snapshots are created in parallel: > > > > run_check $XFS_IO_PROG -f \ > > -c "pwrite -S 0xaa -b 32K 0 32K" \ > > -c "fsync" \ > > -c "pwrite -S 0xbb -b 32770 16K 32770" \ > > -c "truncate 90123" \ > > $SCRATCH_MNT/$name > > > > So with the kernel behaviour before 6.13 we expected the snapshot to > > contain any of the following versions of the file: > > > > 1) Empty file; > > > > 2) 32K file reflecting only the first buffered write; > > > > 3) A file with a size of 49154 bytes reflecting both buffered writes; > > > > 4) A file with a size of 90123 bytes, reflecting the truncate operation > > and both buffered writes. > > > > So now the test can fail since kernel 6.13 due to snapshots catching > > partial writes. > > > > However the goal of the test when I wrote it was to verify that if the > > snapshot version of a file gets the truncated size, then the buffered > > writes are visible in the file, since they happened before the truncate > > operation - that is, we can't get a file with a size of 90123 bytes that > > doesn't have the range [0, 16K) full of bytes with a value of 0xaa and > > the range [16K, 49154) full of bytes with the 0xbb value. > > > > So update the test to the new kernel behaviour by verifying only that if > > the file has the size we truncated to, then it has all the data we wrote > > previously with the buffered writes. > > > > Reported-by: Glass Su <glass.su@xxxxxxxx> > > Link: https://lore.kernel.org/linux-btrfs/30FD234D-58FC-4F3C-A947-47A5B6972C01@xxxxxxxx/ > > Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx> > > Reviewed-by: Qu Wenruo <wqu@xxxxxxxx> > > Thanks for the detailed analyze and the fix. > > Thanks, > Qu > > > --- > > tests/btrfs/080 | 42 +++++++++++++++++++++++------------------- > > 1 file changed, 23 insertions(+), 19 deletions(-) > > > > diff --git a/tests/btrfs/080 b/tests/btrfs/080 > > index ea9d09b0..aa97d3f6 100755 > > --- a/tests/btrfs/080 > > +++ b/tests/btrfs/080 > > @@ -32,6 +32,8 @@ _cleanup() > > > > _require_scratch_nocheck > > > > +truncate_offset=90123 > > + > > create_snapshot() > > { > > local ts=`date +'%H_%M_%S_%N'` > > @@ -48,7 +50,7 @@ create_file() > > -c "pwrite -S 0xaa -b 32K 0 32K" \ > > -c "fsync" \ > > -c "pwrite -S 0xbb -b 32770 16K 32770" \ > > - -c "truncate 90123" \ > > + -c "truncate $truncate_offset" \ > > $SCRATCH_MNT/$name > > } > > > > @@ -81,6 +83,12 @@ _scratch_mkfs "$mkfs_options" >>$seqres.full 2>&1 > > > > _scratch_mount > > > > +# Create a file while no snapshotting is in progress so that we get the expected > > +# digest for every file in a snapshot that caught the truncate operation (which > > +# sets the file's size to $truncate_offset). > > +create_file "gold_file" > > +expected_digest=`_md5_checksum "$SCRATCH_MNT/gold_file"` > > + > > # Run some background load in order to make the issue easier to trigger. > > # Specially needed when testing with non-debug kernels and there isn't > > # any other significant load on the test machine other than this test. > > @@ -103,24 +111,20 @@ for ((i = 0; i < $num_procs; i++)); do > > done > > > > for f in $(find $SCRATCH_MNT -type f -name 'foobar_*'); do > > - digest=`md5sum $f | cut -d ' ' -f 1` > > - case $digest in > > - "d41d8cd98f00b204e9800998ecf8427e") > > - # ok, empty file > > - ;; > > - "c28418534a020122aca59fd3ff9581b5") > > - # ok, only first write captured > > - ;; > > - "cd0032da89254cdc498fda396e6a9b54") > > - # ok, only 2 first writes captured > > - ;; > > - "a1963f914baf4d2579d643425f4e54bc") > > - # ok, the 2 writes and the truncate were captured > > - ;; > > - *) > > - # not ok, truncate captured but not one or both writes > > - _fail "Unexpected digest for file $f" > > - esac > > + file_size=$(stat -c%s "$f") > > + # We want to verify that if the file has the size set by the truncate > > + # operation, then both delalloc writes were flushed, and every version > > + # of the file in each snapshot has its range [0, 16K) full of bytes with > > + # a value of 0xaa and the range [16K, 49154) is full of bytes with a > > + # value of 0xbb. > > + if [ "$file_size" -eq "$truncate_offset" ]; then > > + digest=`_md5_checksum "$f"` > > + if [ "$digest" != "$expected_digest" ]; then > > + echo -e "Unexpected content for file $f:\n" > > + _hexdump "$f" > > + _fail "Bad file content" > > + fi > > + fi > > done > > > > # Check the filesystem for inconsistencies. > >