On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote: > [BUG] > If submit_one_sector() failed inside extent_writepage_io() for sector > size < page size cases (e.g. 4K sector size and 64K page size), then > we can hit double ordered extent accounting error. > > This should be very rare, as submit_one_sector() only fails when we > failed to grab the extent map, and such extent map should exist inside > the memory and have been pinned. > > [CAUSE] > For example we have the following folio layout: > > 0 4K 32K 48K 60K 64K > |//| |//////| |///| > > Where |///| is the dirty range we need to writeback. The 3 different > dirty ranges are submitted for regular COW. > > Now we hit the following sequence: > > - submit_one_sector() returned 0 for [0, 4K) > > - submit_one_sector() returned 0 for [32K, 48K) > > - submit_one_sector() returned error for [60K, 64K) > > - btrfs_mark_ordered_io_finished() called for the whole folio > This will mark the following ranges as finished: > * [0, 4K) > * [32K, 48K) > Both ranges have their IO already submitted, this cleanup will > lead to double accounting. > > * [60K, 64K) > That's the correct cleanup. > > The only good news is, this error is only theoretical, as the target > extent map is always pinned, thus we should directly grab it from > memory, other than reading it from the disk. > > [FIX] > Instead of calling btrfs_mark_ordered_io_finished() for the whole folio > range, which can touch ranges we should not touch, instead > move the error handling inside extent_writepage_io(). > > So that we can cleanup exact sectors that are ought to be submitted but > failed. > > This provide much more accurate cleanup, avoiding the double accounting. Analysis and fix both make sense to me. However, this one feels a lot more fragile than the other one. It relies on submit_one_sector being the only error path in extent_writepage_io. Any future error in the loop would have to create a shared "per sector" error handling goto in the loop I guess? Not a hard "no", in the sense that I think the code is correct for now (aside from my submit_one_bio question) but curious if we can give this some more principled structure. Thanks, Boris > > Cc: stable@xxxxxxxxxxxxxxx # 5.15+ > Signed-off-by: Qu Wenruo <wqu@xxxxxxxx> > --- > fs/btrfs/extent_io.c | 32 +++++++++++++++++++------------- > 1 file changed, 19 insertions(+), 13 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 417c710c55ca..b6a4f1765b4c 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > struct btrfs_fs_info *fs_info = inode->root->fs_info; > unsigned long range_bitmap = 0; > bool submitted_io = false; > + bool error = false; > const u64 folio_start = folio_pos(folio); > u64 cur; > int bit; > @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > break; > } > ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); > - if (ret < 0) > - goto out; > + if (unlikely(ret < 0)) { > + submit_one_bio(bio_ctrl); This submit_one_bio is confusing to me. submit_one_sector failed and the subsequent comment says "there is no bio submitted" yet right here we call submit_one_bio. What is the meaning of it? > + /* > + * Failed to grab the extent map which should be very rare. > + * Since there is no bio submitted to finish the ordered > + * extent, we have to manually finish this sector. > + */ > + btrfs_mark_ordered_io_finished(inode, folio, cur, > + fs_info->sectorsize, false); > + error = true; > + continue; > + } > submitted_io = true; > } > -out: > + > /* > * If we didn't submitted any sector (>= i_size), folio dirty get > * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared > @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > * > * Here we set writeback and clear for the range. If the full folio > * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag. > + * > + * If we hit any error, the corresponding sector will still be dirty > + * thus no need to clear PAGECACHE_TAG_DIRTY. > */ submitted_io is only used for this bit of logic, so you could consider changing this logic by keeping a single variable for whether or not we should go into this logic (naming it seems kind of annoying) and then setting it in both the error and submitted_io paths. I think that reduces headache in thinking about boolean logic, slightly. > - if (!submitted_io) { > + if (!submitted_io && !error) { > btrfs_folio_set_writeback(fs_info, folio, start, len); > btrfs_folio_clear_writeback(fs_info, folio, start, len); > } > @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl > { > struct inode *inode = folio->mapping->host; > struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); > - const u64 page_start = folio_pos(folio); > int ret; > size_t pg_offset; > loff_t i_size = i_size_read(inode); > @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl > > bio_ctrl->wbc->nr_to_write--; > > - if (ret) > - btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio, > - page_start, PAGE_SIZE, !ret); > - > done: > if (ret < 0) > mapping_set_error(folio->mapping, ret); > @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f > if (ret == 1) > goto next_page; > > - if (ret) { > - btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio, > - cur, cur_len, !ret); > + if (ret) > mapping_set_error(mapping, ret); > - } > btrfs_folio_end_lock(fs_info, folio, cur, cur_len); > if (ret < 0) > found_error = true; > -- > 2.47.1 >