在 2024/4/8 20:18, Filipe Manana 写道:
On Sun, Apr 7, 2024 at 2:18 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
[BUG]
During my extent_map cleanup/refactor, with more than too strict sanity
checks, extent-map-tests::test_case_7() would crash my extent_map sanity
checks.
The problem is, after btrfs_drop_extent_map_range(), the resulted
extent_map has a @block_start way too large.
Meanwhile my btrfs_file_extent_item based members are returning a
correct @disk_bytenr along with correct @offset.
The extent map layout looks like this:
0 16K 32K 48K
| PINNED | | Regular |
The regular em at [32K, 48K) also has 32K @block_start.
Then drop range [0, 36K), which should shrink the regular one to be
[36K, 48K).
However the @block_start is incorrect, we expect 32K + 4K, but got 52K.
[CAUSE]
Inside btrfs_drop_extent_map_range() function, if we hit an extent_map
that covers the target range but is still beyond it, we need to split
that extent map into half:
|<-- drop range -->|
|<----- existing extent_map --->|
And if the extent map is not compressed, we need to forward
extent_map::block_start by the difference between the end of drop range
and the extent map start.
However in that particular case, the difference is calculated using
(start + len - em->start).
The problem is @start can be modified if the drop range covers any
pinned extent.
This leads to wrong calculation, and would be caught by my later
extent_map sanity checks, which checks the em::block_start against
btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset.
And unfortunately this is going to cause data corruption, as the
splitted em is pointing an incorrect location, can cause either
unexpected read error or wild writes.
It can't happen for either reads or writes actually.
As for writes, it can't happen because:
1) The issue only happens when skip_pinned is true, which is the only
case that adjusts the 'start' variable (parameter);
2) All IO paths pass false for the skip_pinned parameter, only
relocation passes true when replacing the bytenr in file extent items,
and the range it uses for btrfs_drop_extent_map_range() matches the
extent item's range, so it won't cover extent maps outside the range;
Thankfully that's what I missed.
In that case we're fine.
3) Extent maps for writes in progress are always pinned;
4) Before doing IO on a range we lock the range and wait for any
existing ordered extents in the range to complete, which results in
unpinning extent maps;
5) Extent maps for writes are created when running delalloc (or during
the write for direct IO), along with the ordered extent, and are
created as pinned.
With all these, I don't see how we can get a "wild write" or any
problem in a write path.
As for reads, it doesn't happen because of what's said in 2 regarding
the range passed to btrfs_drop_extent_map_range().
So as far as I can see, it's currently a harmless bug, and maybe it
always has been because the bad calculation has been there since 2008,
see below.
If it affected reads or writes, it would be easy to trigger with
fstests and fsx for example (fstests).
It's certainly a bug, it just doesn't have any consequences as far as
I can see, so the changelog should be updated.
[FIX]
Fix it by avoiding using @start completely, and use @end - em->start
instead, which @end is exclusive bytenr number.
And update the test case to verify the @block_start to prevent such
problem from happening.
CC: stable@xxxxxxxxxxxxxxx # 6.7+
Fixes: c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range")
That commit doesn't influence how split->block_start is updated, only
split->start and split->len.
So I can't understand why you chose to blame that commit.
That patch removed the @len update when updating @start.
Before that patch every time we update @start, @len would be changed to
keep the end the same.
The bug was actually introduced in 2008 by the following commit:
3b951516ed70 ("Btrfs: Use the extent map cache to find the logical
disk block during data retries")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b951516ed703af0f6d82053937655ad69b60864
Nope, just before the offending patch, the code looks like this for
pinned extent maps:
if (skip_pinned && test_bit(EXTENT_FLAG_PINNED,
&em->flags)) {
start = em_end;
if (end != (u64)-1)
len = start + len - em_end;
goto next;
}
Which is correct.
Thanks,
Qu
Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
---
fs/btrfs/extent_map.c | 2 +-
fs/btrfs/tests/extent-map-tests.c | 6 +++++-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 471654cb65b0..955ce300e5a1 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -799,7 +799,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
split->block_len = em->block_len;
split->orig_start = em->orig_start;
} else {
- const u64 diff = start + len - em->start;
+ const u64 diff = end - em->start;
split->block_len = split->len;
split->block_start += diff;
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 253cce7ffecf..80e71c5cb7ab 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -818,7 +818,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
test_err("em->len is %llu, expected 16K", em->len);
goto out;
}
-
Please avoid such accidental changes.
Thanks.
free_extent_map(em);
read_lock(&em_tree->lock);
@@ -847,6 +846,11 @@ static int test_case_7(struct btrfs_fs_info *fs_info)
goto out;
}
+ if (em->block_start != SZ_32K + SZ_4K) {
+ test_err("em->block_start is %llu, expected 36K", em->block_start);
+ goto out;
+ }
+
free_extent_map(em);
read_lock(&em_tree->lock);
--
2.44.0