[PATCH 1/8] btrfs: fix double accounting race when btrfs_run_delalloc_range() failed

Qu Wenruo <wqu@xxxxxxxx> · Sun, 8 Dec 2024 13:20:58 +1030

[BUG]
There are several double accounting case, where the WARN_ON_ONCE() is
triggered inside can_finish_ordered_extent().

And all such cases points back to the btrfs_mark_ordered_io_finished()
call inside extent_writepage() when it hits some error.

[CAUSE]
With extra debug patches to show where the error is from, it turns out
to be btrfs_run_delalloc_range() can fail with -ENOSPC.

Such failure itself is already a symptom of some bad data/metadata space
reservation, but here we need to focus on the error handling part.

For example, we have the following dirty page layout (4K sector size and
4K page size):

    0                       16K                     32K
    |/////|/////|/////|/////|/////|/////|/////|/////|

Where the range [0, 32K) is dirty and we need to write all the 8 pages
back.

When handling the first page 0, we go the following sequence:

- btrfs_run_delalloc_range() for range [0, 32k)
  We enter cow_file_range() for [0, 32K)

- btrfs_reserve_extent() only returned a 16K data extent.
  This can be caused by fragmentation, and it's already an indication
  we're almost running of space.

  Now we have the following layout:

    0                       16K                     32K
    |<----- Reserved ------>|/////|/////|/////|/////|

  The range [0, 16K) has ordered extent allocated.

- btrfs_reserve_extent() returned -ENOSPC
  We really run out of space. But since we have reserved space
  for range [0, 16K) we need to clean them up.

  But that cleanup for ordered extent only happens inside
  btrfs_run_delalloc_range().

- btrfs_run_delalloc_range() cleanup the reserved ordered extent
  By calling btrfs_mark_ordered_io_finished() for range [0, 32K).

  It will locate the ordered extent [0, 16K) and mark it as IOERR.
  Also since the ordered extent is only 16K, we're finishing the whole
  ordered extent.

  Thus we call btrfs_queue_ordered_fn() to queue to finish the ordered
  extent.
  But still, the ordered extent [0, 16K) is still in the
  btrfs_inode::ordered_tree.

- extent_writepage() cleanup the ordered extent inside the folio
  We call btrfs_mark_ordered_io_finished() for range [0, 4K).

  Since the finished ordered extent [0, 16K) is not yet removed (racy,
  depends on when btrfs_finish_one_ordered() is called), if
  btrfs_mark_ordered_io_finished() is called before
  btrfs_finish_one_ordered(), we will double account and trigger the
  warning inside can_finish_ordered_extent().

So the root cause is, we're relying on btrfs_mark_ordered_io_finished()
to handle ranges which is already cleaned up.

Unfortunately the bug dates back to the early days when
btrfs_mark_ordered_io_finished() is introduced as a no-brain choice for
error paths, but such no-brain solution just hides all the race and make
us less cautious when handling errors.

[FIX]
Instead of relying on the btrfs_mark_ordered_io_finished() call to
cleanup the whole folio range, record the last successfully ran delalloc
range.

And combined with bio_ctrl->submit_bitmap to properly clean up any newly
created ordered extents.

Since we have cleaned up the ordered extents in range, we should not
rely on the btrfs_mark_ordered_io_finished() inside extent_writepage()
anymore.

By this, we ensure btrfs_mark_ordered_io_finished() is only called once
when writepage_delalloc() failed.

Cc: stable@xxxxxxxxxxxxxxx # 5.15+
Fixes: e65f152e4348 ("btrfs: refactor how we finish ordered extent io for endio functions")
Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
---
 fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9725ff7f274d..417c710c55ca 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1167,6 +1167,12 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	 * last delalloc end.
 	 */
 	u64 last_delalloc_end = 0;
+	/*
+	 * Save the last successfully ran delalloc range end (exclusive).
+	 * This is for error handling to avoid ranges with ordered extent created
+	 * but no IO will be submitted due to error.
+	 */
+	u64 last_finished = page_start;
 	u64 delalloc_start = page_start;
 	u64 delalloc_end = page_end;
 	u64 delalloc_to_write = 0;
@@ -1235,11 +1241,19 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 			found_len = last_delalloc_end + 1 - found_start;
 
 		if (ret >= 0) {
+			/*
+			 * Some delalloc range may be created by previous folios.
+			 * Thus we still need to clean those range up during error
+			 * handling.
+			 */
+			last_finished = found_start;
 			/* No errors hit so far, run the current delalloc range. */
 			ret = btrfs_run_delalloc_range(inode, folio,
 						       found_start,
 						       found_start + found_len - 1,
 						       wbc);
+			if (ret >= 0)
+				last_finished = found_start + found_len;
 		} else {
 			/*
 			 * We've hit an error during previous delalloc range,
@@ -1274,8 +1288,21 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 
 		delalloc_start = found_start + found_len;
 	}
-	if (ret < 0)
+	/*
+	 * It's possible we have some ordered extents created before we hit
+	 * an error, cleanup non-async successfully created delalloc ranges.
+	 */
+	if (unlikely(ret < 0)) {
+		unsigned int bitmap_size = min(
+			(last_finished - page_start) >> fs_info->sectorsize_bits,
+			fs_info->sectors_per_page);
+
+		for_each_set_bit(bit, &bio_ctrl->submit_bitmap, bitmap_size)
+			btrfs_mark_ordered_io_finished(inode, folio,
+				page_start + (bit << fs_info->sectorsize_bits),
+				fs_info->sectorsize, false);
 		return ret;
+	}
 out:
 	if (last_delalloc_end)
 		delalloc_end = last_delalloc_end;
@@ -1509,13 +1536,13 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 
 	bio_ctrl->wbc->nr_to_write--;
 
-done:
-	if (ret) {
+	if (ret)
 		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
 					       page_start, PAGE_SIZE, !ret);
-		mapping_set_error(folio->mapping, ret);
-	}
 
+done:
+	if (ret < 0)
+		mapping_set_error(folio->mapping, ret);
 	/*
 	 * Only unlock ranges that are submitted. As there can be some async
 	 * submitted ranges inside the folio.
-- 
2.47.1