Patch "btrfs: fix data overwriting bug during buffered write when block size < page size" has been added to the 6.12-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Sat, 8 Mar 2025 09:09:34 -0500

This is a note to let you know that I've just added the patch titled

    btrfs: fix data overwriting bug during buffered write when block size < page size

to the 6.12-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     btrfs-fix-data-overwriting-bug-during-buffered-write.patch
and it can be found in the queue-6.12 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 6226abdef75084708ec8ed2a2af6d774f6d227b1
Author: Qu Wenruo <wqu@xxxxxxxx>
Date:   Wed Feb 19 09:06:33 2025 +1030

    btrfs: fix data overwriting bug during buffered write when block size < page size
    
    [ Upstream commit efa11fd269c139e29b71ec21bc9c9c0063fde40d ]
    
    [BUG]
    When running generic/418 with a btrfs whose block size < page size
    (subpage cases), it always fails.
    
    And the following minimal reproducer is more than enough to trigger it
    reliably:
    
    workload()
    {
            mkfs.btrfs -s 4k -f $dev > /dev/null
            dmesg -C
            mount $dev $mnt
            $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
            ret=$?
            umount $mnt
            stop_trace
            if [ $ret -ne 0 ]; then
                    fail
            fi
    }
    
    for (( i = 0; i < 1024; i++)); do
            echo "=== $i/$runtime ==="
            workload
    done
    
    [CAUSE]
    With extra trace printk added to the following functions:
    - btrfs_buffered_write()
      * Which folio is touched
      * The file offset (start) where the buffered write is at
      * How many bytes are copied
      * The content of the write (the first 2 bytes)
    
    - submit_one_sector()
      * Which folio is touched
      * The position inside the folio
      * The content of the page cache (the first 2 bytes)
    
    - pagecache_isize_extended()
      * The parameters of the function itself
      * The parameters of the folio_zero_range()
    
    Which are enough to show the problem:
    
      22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
      22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
      22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
      22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
      22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
      22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
      22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101
    
    The tool dio-invalidate-cache will start 3 threads, each doing a buffered
    write with 0x01 at offset 0, 4096 and 8192, do a fsync, then do a direct read,
    and compare the read buffer with the write buffer.
    
    Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
    the page cache.
    
    But at submit_one_sector(), at file offset 4096, the content is zeroed
    out, by pagecache_isize_extended().
    
    The race happens like this:
     Thread A is writing into range [4K, 8K).
     Thread B is writing into range [8K, 12k).
    
                   Thread A              |         Thread B
    -------------------------------------+------------------------------------
    btrfs_buffered_write()               | btrfs_buffered_write()
    |- old_isize = 4K;                   | |- old_isize = 4096;
    |- btrfs_inode_lock()                | |
    |- write into folio range [4K, 8K)   | |
    |- pagecache_isize_extended()        | |
    |  extend isize from 4096 to 8192    | |
    |  no folio_zero_range() called      | |
    |- btrfs_inode_lock()                | |
                                         | |- btrfs_inode_lock()
                                         | |- write into folio range [8K, 12K)
                                         | |- pagecache_isize_extended()
                                         | |  calling folio_zero_range(4K, 8K)
                                         | |  This is caused by the old_isize is
                                         | |  grabbed too early, without any
                                         | |  inode lock.
                                         | |- btrfs_inode_unlock()
    
    The @old_isize is grabbed without inode lock, causing race between two
    buffered write threads and making pagecache_isize_extended() to zero
    range which is still containing cached data.
    
    And this is only affecting subpage btrfs, because for regular blocksize
    == page size case, the function pagecache_isize_extended() will do
    nothing if the block size >= page size.
    
    [FIX]
    Grab the old i_size while holding the inode lock.
    This means each buffered write thread will have a stable view of the
    old inode size, thus avoid the above race.
    
    CC: stable@xxxxxxxxxxxxxxx # 5.15+
    Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write")
    Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx>
    Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
    Signed-off-by: David Sterba <dsterba@xxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 848cb2c3d9dde..78c4a3765002e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1200,7 +1200,7 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
 	ssize_t ret;
 	bool only_release_metadata = false;
 	bool force_page_uptodate = false;
-	loff_t old_isize = i_size_read(inode);
+	loff_t old_isize;
 	unsigned int ilock_flags = 0;
 	const bool nowait = (iocb->ki_flags & IOCB_NOWAIT);
 	unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0);
@@ -1212,6 +1212,13 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
 	if (ret < 0)
 		return ret;
 
+	/*
+	 * We can only trust the isize with inode lock held, or it can race with
+	 * other buffered writes and cause incorrect call of
+	 * pagecache_isize_extended() to overwrite existing data.
+	 */
+	old_isize = i_size_read(inode);
+
 	ret = generic_write_checks(iocb, i);
 	if (ret <= 0)
 		goto out;