Re: [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed

Boris Burkov <boris@xxxxxx> · Thu, 9 Jan 2025 10:06:24 -0800

On Thu, Jan 09, 2025 at 02:15:06PM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/1/9 08:54, Boris Burkov 写道:
> > On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote:
> > > [BUG]
> > > If submit_one_sector() failed inside extent_writepage_io() for sector
> > > size < page size cases (e.g. 4K sector size and 64K page size), then
> > > we can hit double ordered extent accounting error.
> > > 
> > > This should be very rare, as submit_one_sector() only fails when we
> > > failed to grab the extent map, and such extent map should exist inside
> > > the memory and have been pinned.
> > > 
> > > [CAUSE]
> > > For example we have the following folio layout:
> > > 
> > >      0  4K          32K    48K   60K 64K
> > >      |//|           |//////|     |///|
> > > 
> > > Where |///| is the dirty range we need to writeback. The 3 different
> > > dirty ranges are submitted for regular COW.
> > > 
> > > Now we hit the following sequence:
> > > 
> > > - submit_one_sector() returned 0 for [0, 4K)
> > > 
> > > - submit_one_sector() returned 0 for [32K, 48K)
> > > 
> > > - submit_one_sector() returned error for [60K, 64K)
> > > 
> > > - btrfs_mark_ordered_io_finished() called for the whole folio
> > >    This will mark the following ranges as finished:
> > >    * [0, 4K)
> > >    * [32K, 48K)
> > >      Both ranges have their IO already submitted, this cleanup will
> > >      lead to double accounting.
> > > 
> > >    * [60K, 64K)
> > >      That's the correct cleanup.
> > > 
> > > The only good news is, this error is only theoretical, as the target
> > > extent map is always pinned, thus we should directly grab it from
> > > memory, other than reading it from the disk.
> > > 
> > > [FIX]
> > > Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
> > > range, which can touch ranges we should not touch, instead
> > > move the error handling inside extent_writepage_io().
> > > 
> > > So that we can cleanup exact sectors that are ought to be submitted but
> > > failed.
> > > 
> > > This provide much more accurate cleanup, avoiding the double accounting.
> > 
> > Analysis and fix both make sense to me. However, this one feels a lot
> > more fragile than the other one.
> > 
> > It relies on submit_one_sector being the only error path in
> > extent_writepage_io. Any future error in the loop would have to create a
> > shared "per sector" error handling goto in the loop I guess?
> > 
> > Not a hard "no", in the sense that I think the code is correct for now
> > (aside from my submit_one_bio question) but curious if we can give this
> > some more principled structure.
> > 
> > Thanks,
> > Boris
> > 
> > > 
> > > Cc: stable@xxxxxxxxxxxxxxx # 5.15+
> > > Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> > > ---
> > >   fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
> > >   1 file changed, 19 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 417c710c55ca..b6a4f1765b4c 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
> > >   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> > >   	unsigned long range_bitmap = 0;
> > >   	bool submitted_io = false;
> > > +	bool error = false;
> > >   	const u64 folio_start = folio_pos(folio);
> > >   	u64 cur;
> > >   	int bit;
> > > @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
> > >   			break;
> > >   		}
> > >   		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
> > > -		if (ret < 0)
> > > -			goto out;
> > > +		if (unlikely(ret < 0)) {
> > > +			submit_one_bio(bio_ctrl);
> > 
> > This submit_one_bio is confusing to me. submit_one_sector failed and the
> > subsequent comment says "there is no bio submitted" yet right here we
> > call submit_one_bio.
> > 
> > What is the meaning of it?
> > 
> > > +			/*
> > > +			 * Failed to grab the extent map which should be very rare.
> > > +			 * Since there is no bio submitted to finish the ordered
> > > +			 * extent, we have to manually finish this sector.
> > > +			 */
> > > +			btrfs_mark_ordered_io_finished(inode, folio, cur,
> > > +					fs_info->sectorsize, false);
> > > +			error = true;
> > > +			continue;
> > > +		}
> > >   		submitted_io = true;
> > >   	}
> > > -out:
> > > +
> > >   	/*
> > >   	 * If we didn't submitted any sector (>= i_size), folio dirty get
> > >   	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
> > > @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
> > >   	 *
> > >   	 * Here we set writeback and clear for the range. If the full folio
> > >   	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
> > > +	 *
> > > +	 * If we hit any error, the corresponding sector will still be dirty
> > > +	 * thus no need to clear PAGECACHE_TAG_DIRTY.
> > >   	 */
> > 
> > submitted_io is only used for this bit of logic, so you could consider
> > changing this logic by keeping a single variable for whether or not we
> > should go into this logic (naming it seems kind of annoying) and then
> > setting it in both the error and submitted_io paths. I think that
> > reduces headache in thinking about boolean logic, slightly.
> 
> Unfortunately I can not find a good alternative to this double boolean
> usages.
> 
> I can go a single boolean, but it will be called something like
> @no_error_nor_submission.
> 
> Which is the not only the worst naming, but also a hell of boolean
> operations for a single bool.

I think you could do something like:

needs_reset_writeback = false;
then set it to true in either case, whether you submit an io or hit an
error.

It's your call, though, I won't be upset if you leave it as is.

> 
> So I'm afraid the @error and @submitted_io will still be better for this
> case.
> 
> The other comments will be addressed properly.
> 
> Thanks,
> Qu
> > 
> > > -	if (!submitted_io) {
> > > +	if (!submitted_io && !error) {
> > >   		btrfs_folio_set_writeback(fs_info, folio, start, len);
> > >   		btrfs_folio_clear_writeback(fs_info, folio, start, len);
> > >   	}
> > > @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
> > >   {
> > >   	struct inode *inode = folio->mapping->host;
> > >   	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> > > -	const u64 page_start = folio_pos(folio);
> > >   	int ret;
> > >   	size_t pg_offset;
> > >   	loff_t i_size = i_size_read(inode);
> > > @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
> > > 
> > >   	bio_ctrl->wbc->nr_to_write--;
> > > 
> > > -	if (ret)
> > > -		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> > > -					       page_start, PAGE_SIZE, !ret);
> > > -
> > >   done:
> > >   	if (ret < 0)
> > >   		mapping_set_error(folio->mapping, ret);
> > > @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
> > >   		if (ret == 1)
> > >   			goto next_page;
> > > 
> > > -		if (ret) {
> > > -			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> > > -						       cur, cur_len, !ret);
> > > +		if (ret)
> > >   			mapping_set_error(mapping, ret);
> > > -		}
> > >   		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
> > >   		if (ret < 0)
> > >   			found_error = true;
> > > --
> > > 2.47.1
> > > 
> > 
>