Re: [PATCH v2 2/9] btrfs: fix double accounting race when extent_writepage_io() failed

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Fri, 10 Jan 2025 07:04:49 +1030

在 2025/1/10 04:36, Boris Burkov 写道:
On Thu, Jan 09, 2025 at 02:15:06PM +1030, Qu Wenruo wrote:


在 2025/1/9 08:54, Boris Burkov 写道:
On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote:
[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.

This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and have been pinned.

[CAUSE]
For example we have the following folio layout:

      0  4K          32K    48K   60K 64K
      |//|           |//////|     |///|

Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.

Now we hit the following sequence:

- submit_one_sector() returned 0 for [0, 4K)

- submit_one_sector() returned 0 for [32K, 48K)

- submit_one_sector() returned error for [60K, 64K)

- btrfs_mark_ordered_io_finished() called for the whole folio
    This will mark the following ranges as finished:
    * [0, 4K)
    * [32K, 48K)
      Both ranges have their IO already submitted, this cleanup will
      lead to double accounting.

    * [60K, 64K)
      That's the correct cleanup.

The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.

[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().

So that we can cleanup exact sectors that are ought to be submitted but
failed.

This provide much more accurate cleanup, avoiding the double accounting.

Analysis and fix both make sense to me. However, this one feels a lot
more fragile than the other one.

It relies on submit_one_sector being the only error path in
extent_writepage_io. Any future error in the loop would have to create a
shared "per sector" error handling goto in the loop I guess?

Not a hard "no", in the sense that I think the code is correct for now
(aside from my submit_one_bio question) but curious if we can give this
some more principled structure.

Thanks,
Boris


Cc: stable@xxxxxxxxxxxxxxx # 5.15+
Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
---
   fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
   1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 417c710c55ca..b6a4f1765b4c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
   	unsigned long range_bitmap = 0;
   	bool submitted_io = false;
+	bool error = false;
   	const u64 folio_start = folio_pos(folio);
   	u64 cur;
   	int bit;
@@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
   			break;
   		}
   		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
-		if (ret < 0)
-			goto out;
+		if (unlikely(ret < 0)) {
+			submit_one_bio(bio_ctrl);

This submit_one_bio is confusing to me. submit_one_sector failed and the
subsequent comment says "there is no bio submitted" yet right here we
call submit_one_bio.

What is the meaning of it?

+			/*
+			 * Failed to grab the extent map which should be very rare.
+			 * Since there is no bio submitted to finish the ordered
+			 * extent, we have to manually finish this sector.
+			 */
+			btrfs_mark_ordered_io_finished(inode, folio, cur,
+					fs_info->sectorsize, false);
+			error = true;
+			continue;
+		}
   		submitted_io = true;
   	}
-out:
+
   	/*
   	 * If we didn't submitted any sector (>= i_size), folio dirty get
   	 * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
@@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
   	 *
   	 * Here we set writeback and clear for the range. If the full folio
   	 * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
+	 *
+	 * If we hit any error, the corresponding sector will still be dirty
+	 * thus no need to clear PAGECACHE_TAG_DIRTY.
   	 */

submitted_io is only used for this bit of logic, so you could consider
changing this logic by keeping a single variable for whether or not we
should go into this logic (naming it seems kind of annoying) and then
setting it in both the error and submitted_io paths. I think that
reduces headache in thinking about boolean logic, slightly.

Unfortunately I can not find a good alternative to this double boolean
usages.

I can go a single boolean, but it will be called something like
@no_error_nor_submission.

Which is the not only the worst naming, but also a hell of boolean
operations for a single bool.

I think you could do something like:

needs_reset_writeback = false;

Unfortunately, that will not work if setting it to false.

We have to set it default to true, and only set it to false in the error
or submission path.

This also means, we need to explain why we need to set the bool to false
in both paths (aka, duplicated comments)

then set it to true in either case, whether you submit an io or hit an
error.

It's your call, though, I won't be upset if you leave it as is.

I'm afraid I'll leave it as is for now.

And hope in the future we can remove the @error bool by removing the the
extent map related error path at least.

Thanks,
Qu



So I'm afraid the @error and @submitted_io will still be better for this
case.

The other comments will be addressed properly.

Thanks,
Qu

-	if (!submitted_io) {
+	if (!submitted_io && !error) {
   		btrfs_folio_set_writeback(fs_info, folio, start, len);
   		btrfs_folio_clear_writeback(fs_info, folio, start, len);
   	}
@@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
   {
   	struct inode *inode = folio->mapping->host;
   	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
-	const u64 page_start = folio_pos(folio);
   	int ret;
   	size_t pg_offset;
   	loff_t i_size = i_size_read(inode);
@@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl

   	bio_ctrl->wbc->nr_to_write--;

-	if (ret)
-		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-					       page_start, PAGE_SIZE, !ret);
-
   done:
   	if (ret < 0)
   		mapping_set_error(folio->mapping, ret);
@@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
   		if (ret == 1)
   			goto next_page;

-		if (ret) {
-			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-						       cur, cur_len, !ret);
+		if (ret)
   			mapping_set_error(mapping, ret);
-		}
   		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
   		if (ret < 0)
   			found_error = true;
--
2.47.1