Re: [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote:
> [BUG]
> If submit_one_sector() failed inside extent_writepage_io() for sector
> size < page size cases (e.g. 4K sector size and 64K page size), then
> we can hit double ordered extent accounting error.
> 
> This should be very rare, as submit_one_sector() only fails when we
> failed to grab the extent map, and such extent map should exist inside
> the memory and have been pinned.
> 
> [CAUSE]
> For example we have the following folio layout:
> 
>     0  4K          32K    48K   60K 64K
>     |//|           |//////|     |///|
> 
> Where |///| is the dirty range we need to writeback. The 3 different
> dirty ranges are submitted for regular COW.
> 
> Now we hit the following sequence:
> 
> - submit_one_sector() returned 0 for [0, 4K)
> 
> - submit_one_sector() returned 0 for [32K, 48K)
> 
> - submit_one_sector() returned error for [60K, 64K)
> 
> - btrfs_mark_ordered_io_finished() called for the whole folio
>   This will mark the following ranges as finished:
>   * [0, 4K)
>   * [32K, 48K)
>     Both ranges have their IO already submitted, this cleanup will
>     lead to double accounting.
> 
>   * [60K, 64K)
>     That's the correct cleanup.
> 
> The only good news is, this error is only theoretical, as the target
> extent map is always pinned, thus we should directly grab it from
> memory, other than reading it from the disk.
> 
> [FIX]
> Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
> range, which can touch ranges we should not touch, instead
> move the error handling inside extent_writepage_io().
> 
> So that we can cleanup exact sectors that are ought to be submitted but
> failed.
> 
> This provide much more accurate cleanup, avoiding the double accounting.

Analysis and fix both make sense to me. However, this one feels a lot
more fragile than the other one.

It relies on submit_one_sector being the only error path in
extent_writepage_io. Any future error in the loop would have to create a
shared "per sector" error handling goto in the loop I guess?

Not a hard "no", in the sense that I think the code is correct for now
(aside from my submit_one_bio question) but curious if we can give this
some more principled structure.

Thanks,
Boris

> 
> Cc: stable@xxxxxxxxxxxxxxx # 5.15+
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
>  fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 417c710c55ca..b6a4f1765b4c 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  	unsigned long range_bitmap = 0;
>  	bool submitted_io = false;
> +	bool error = false;
>  	const u64 folio_start = folio_pos(folio);
>  	u64 cur;
>  	int bit;
> @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  			break;
>  		}
>  		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
> -		if (ret < 0)
> -			goto out;
> +		if (unlikely(ret < 0)) {
> +			submit_one_bio(bio_ctrl);

This submit_one_bio is confusing to me. submit_one_sector failed and the
subsequent comment says "there is no bio submitted" yet right here we
call submit_one_bio.

What is the meaning of it?

> +			/*
> +			 * Failed to grab the extent map which should be very rare.
> +			 * Since there is no bio submitted to finish the ordered
> +			 * extent, we have to manually finish this sector.
> +			 */
> +			btrfs_mark_ordered_io_finished(inode, folio, cur,
> +					fs_info->sectorsize, false);
> +			error = true;
> +			continue;
> +		}
>  		submitted_io = true;
>  	}
> -out:
> +
>  	/*
>  	 * If we didn't submitted any sector (>= i_size), folio dirty get
>  	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
> @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  	 *
>  	 * Here we set writeback and clear for the range. If the full folio
>  	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
> +	 *
> +	 * If we hit any error, the corresponding sector will still be dirty
> +	 * thus no need to clear PAGECACHE_TAG_DIRTY.
>  	 */

submitted_io is only used for this bit of logic, so you could consider
changing this logic by keeping a single variable for whether or not we
should go into this logic (naming it seems kind of annoying) and then
setting it in both the error and submitted_io paths. I think that
reduces headache in thinking about boolean logic, slightly.

> -	if (!submitted_io) {
> +	if (!submitted_io && !error) {
>  		btrfs_folio_set_writeback(fs_info, folio, start, len);
>  		btrfs_folio_clear_writeback(fs_info, folio, start, len);
>  	}
> @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>  {
>  	struct inode *inode = folio->mapping->host;
>  	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> -	const u64 page_start = folio_pos(folio);
>  	int ret;
>  	size_t pg_offset;
>  	loff_t i_size = i_size_read(inode);
> @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>  
>  	bio_ctrl->wbc->nr_to_write--;
>  
> -	if (ret)
> -		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> -					       page_start, PAGE_SIZE, !ret);
> -
>  done:
>  	if (ret < 0)
>  		mapping_set_error(folio->mapping, ret);
> @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
>  		if (ret == 1)
>  			goto next_page;
>  
> -		if (ret) {
> -			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> -						       cur, cur_len, !ret);
> +		if (ret)
>  			mapping_set_error(mapping, ret);
> -		}
>  		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
>  		if (ret < 0)
>  			found_error = true;
> -- 
> 2.47.1
> 




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux