Re: [PATCH] btrfs: fix data overwriting bug during buffered write when block size < page size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 18, 2025 at 11:10 PM Qu Wenruo <wqu@xxxxxxxx> wrote:
>
> [BUG]
> When running generic/417 with a btrfs whose block size < page size

generic/417 -> generic/418

> (subpage cases), it always fails.
>
> And the following minimal reproducer is more than enough to trigger it
> reliably:
>
> workload()
> {
>         mkfs.btrfs -s 4k -f $dev > /dev/null
>         dmesg -C
>         mount $dev $mnt
>         $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
>         ret=$?
>         umount $mnt
>         stop_trace
>         if [ $ret -ne 0 ]; then
>                 fail
>         fi
> }
>
> for (( i = 0; i < 1024; i++)); do
>         echo "=== $i/$runtime ==="
>         workload
> done
>
> [CAUSE]
> With extra trace printk added to the following functions:
> - btrfs_buffered_write()
>   * Which folio is touched
>   * The file offset (start) where the buffered write is at
>   * How many bytes are copied
>   * The content of the write (the first 2 bytes)
>
> - submit_one_sector()
>   * Which folio is touched
>   * The position inside the folio
>
> - pagecache_isize_extended()
>   * The parameters of the function itself
>   * The parameters of the folio_zero_range()
>
> Which are enough to show the problem:
>
>   22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
>   22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
>   22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
>   22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
>   22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
>   22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
>   22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101
>
> The tool dio-invalidate-cache will start 3 threads, each doing a buffered
> write with 0x01 at 4096 * i (i is 0, 1 ,2), do a fsync, then do a direct read,
> and compare the read buffer with the write buffer.
>
> Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
> the page cache.
>
> But at submit_one_sector(), at file offset 4096, the content is zeroed
> out, mostly by pagecache_isize_extended().
>
> The race happens like this:
>  Thread A is writing into range [4K, 8K).
>  Thread B is writing into range [8K, 12k).
>
>                Thread A              |         Thread B
> -------------------------------------+------------------------------------
> btrfs_buffered_write()               | btrfs_buffered_write()
> |- old_isize = 4K;                   | |- old_isize = 4096;
> |- btrfs_inode_lock()                | |
> |- write into folio range [4K, 8K)   | |
> |- pagecache_isize_extended()        | |
> |  extend isize from 4096 to 8192    | |
> |  no folio_zero_range() called      | |
> |- btrfs_inode_lock()                | |
>                                      | |- btrfs_inode_lock()
>                                      | |- write into folio range [8K, 12K)
>                                      | |- pagecache_isize_extended()
>                                      | |  calling folio_zero_range(4K, 8K)
>                                      | |  This is caused by the old_isize is
>                                      | |  grabbed too early, without any
>                                      | |  inode lock.
>                                      | |- btrfs_inode_unlock()
>
> The @old_isize is grabbed without inode lock, causing race between two
> buffered write threads and making pagecache_isize_extended() to zero
> range which is still containing cached data.
>
> And this is only affecting subpage btrfs, because for regular blocksize
> == page size case, the function pagecache_isize_extended() will do
> nothing if the block size >= page size.
>
> [FIX]
> Grab the old isize with inode lock hold.

hold -> held

Or easier to understand:

Grab the old i_size while holding the inode lock.

> This means each buffered write thread will have a stable view of the
> old inode size, thus avoid the above race.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>

Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache
extension to btrfs_buffered_write")

That is the commit that introduced the race, before it we were
grabbing i_size and doing the pagecache_isize_extended() call while
holding the inode lock.

> ---
>  fs/btrfs/file.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index fd90855fe717..896dc03689d6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1090,7 +1090,7 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
>         u64 lockend;
>         size_t num_written = 0;
>         ssize_t ret;
> -       loff_t old_isize = i_size_read(inode);
> +       loff_t old_isize;
>         unsigned int ilock_flags = 0;
>         const bool nowait = (iocb->ki_flags & IOCB_NOWAIT);
>         unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0);
> @@ -1103,6 +1103,13 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
>         if (ret < 0)
>                 return ret;
>
> +       /*
> +        * We can only trust the isize with inode lock hold, or it can race with

Same here, hold -> held

With those things addressed:

Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx>

Thanks.

> +        * other buffered writes and cause incorrect call of
> +        * pagecache_isize_extended() to overwrite existing data.
> +        */
> +       old_isize = i_size_read(inode);
> +
>         ret = generic_write_checks(iocb, i);
>         if (ret <= 0)
>                 goto out;
> --
> 2.48.1
>
>





[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux