On Thu, Jan 09, 2025 at 02:15:06PM +1030, Qu Wenruo wrote: > > > 在 2025/1/9 08:54, Boris Burkov 写道: > > On Thu, Dec 12, 2024 at 04:43:56PM +1030, Qu Wenruo wrote: > > > [BUG] > > > If submit_one_sector() failed inside extent_writepage_io() for sector > > > size < page size cases (e.g. 4K sector size and 64K page size), then > > > we can hit double ordered extent accounting error. > > > > > > This should be very rare, as submit_one_sector() only fails when we > > > failed to grab the extent map, and such extent map should exist inside > > > the memory and have been pinned. > > > > > > [CAUSE] > > > For example we have the following folio layout: > > > > > > 0 4K 32K 48K 60K 64K > > > |//| |//////| |///| > > > > > > Where |///| is the dirty range we need to writeback. The 3 different > > > dirty ranges are submitted for regular COW. > > > > > > Now we hit the following sequence: > > > > > > - submit_one_sector() returned 0 for [0, 4K) > > > > > > - submit_one_sector() returned 0 for [32K, 48K) > > > > > > - submit_one_sector() returned error for [60K, 64K) > > > > > > - btrfs_mark_ordered_io_finished() called for the whole folio > > > This will mark the following ranges as finished: > > > * [0, 4K) > > > * [32K, 48K) > > > Both ranges have their IO already submitted, this cleanup will > > > lead to double accounting. > > > > > > * [60K, 64K) > > > That's the correct cleanup. > > > > > > The only good news is, this error is only theoretical, as the target > > > extent map is always pinned, thus we should directly grab it from > > > memory, other than reading it from the disk. > > > > > > [FIX] > > > Instead of calling btrfs_mark_ordered_io_finished() for the whole folio > > > range, which can touch ranges we should not touch, instead > > > move the error handling inside extent_writepage_io(). > > > > > > So that we can cleanup exact sectors that are ought to be submitted but > > > failed. > > > > > > This provide much more accurate cleanup, avoiding the double accounting. > > > > Analysis and fix both make sense to me. However, this one feels a lot > > more fragile than the other one. > > > > It relies on submit_one_sector being the only error path in > > extent_writepage_io. Any future error in the loop would have to create a > > shared "per sector" error handling goto in the loop I guess? > > > > Not a hard "no", in the sense that I think the code is correct for now > > (aside from my submit_one_bio question) but curious if we can give this > > some more principled structure. > > > > Thanks, > > Boris > > > > > > > > Cc: stable@xxxxxxxxxxxxxxx # 5.15+ > > > Signed-off-by: Qu Wenruo <wqu@xxxxxxxx> > > > --- > > > fs/btrfs/extent_io.c | 32 +++++++++++++++++++------------- > > > 1 file changed, 19 insertions(+), 13 deletions(-) > > > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > > index 417c710c55ca..b6a4f1765b4c 100644 > > > --- a/fs/btrfs/extent_io.c > > > +++ b/fs/btrfs/extent_io.c > > > @@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > > > struct btrfs_fs_info *fs_info = inode->root->fs_info; > > > unsigned long range_bitmap = 0; > > > bool submitted_io = false; > > > + bool error = false; > > > const u64 folio_start = folio_pos(folio); > > > u64 cur; > > > int bit; > > > @@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > > > break; > > > } > > > ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); > > > - if (ret < 0) > > > - goto out; > > > + if (unlikely(ret < 0)) { > > > + submit_one_bio(bio_ctrl); > > > > This submit_one_bio is confusing to me. submit_one_sector failed and the > > subsequent comment says "there is no bio submitted" yet right here we > > call submit_one_bio. > > > > What is the meaning of it? > > > > > + /* > > > + * Failed to grab the extent map which should be very rare. > > > + * Since there is no bio submitted to finish the ordered > > > + * extent, we have to manually finish this sector. > > > + */ > > > + btrfs_mark_ordered_io_finished(inode, folio, cur, > > > + fs_info->sectorsize, false); > > > + error = true; > > > + continue; > > > + } > > > submitted_io = true; > > > } > > > -out: > > > + > > > /* > > > * If we didn't submitted any sector (>= i_size), folio dirty get > > > * cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared > > > @@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > > > * > > > * Here we set writeback and clear for the range. If the full folio > > > * is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag. > > > + * > > > + * If we hit any error, the corresponding sector will still be dirty > > > + * thus no need to clear PAGECACHE_TAG_DIRTY. > > > */ > > > > submitted_io is only used for this bit of logic, so you could consider > > changing this logic by keeping a single variable for whether or not we > > should go into this logic (naming it seems kind of annoying) and then > > setting it in both the error and submitted_io paths. I think that > > reduces headache in thinking about boolean logic, slightly. > > Unfortunately I can not find a good alternative to this double boolean > usages. > > I can go a single boolean, but it will be called something like > @no_error_nor_submission. > > Which is the not only the worst naming, but also a hell of boolean > operations for a single bool. I think you could do something like: needs_reset_writeback = false; then set it to true in either case, whether you submit an io or hit an error. It's your call, though, I won't be upset if you leave it as is. > > So I'm afraid the @error and @submitted_io will still be better for this > case. > > The other comments will be addressed properly. > > Thanks, > Qu > > > > > - if (!submitted_io) { > > > + if (!submitted_io && !error) { > > > btrfs_folio_set_writeback(fs_info, folio, start, len); > > > btrfs_folio_clear_writeback(fs_info, folio, start, len); > > > } > > > @@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl > > > { > > > struct inode *inode = folio->mapping->host; > > > struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); > > > - const u64 page_start = folio_pos(folio); > > > int ret; > > > size_t pg_offset; > > > loff_t i_size = i_size_read(inode); > > > @@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl > > > > > > bio_ctrl->wbc->nr_to_write--; > > > > > > - if (ret) > > > - btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio, > > > - page_start, PAGE_SIZE, !ret); > > > - > > > done: > > > if (ret < 0) > > > mapping_set_error(folio->mapping, ret); > > > @@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f > > > if (ret == 1) > > > goto next_page; > > > > > > - if (ret) { > > > - btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio, > > > - cur, cur_len, !ret); > > > + if (ret) > > > mapping_set_error(mapping, ret); > > > - } > > > btrfs_folio_end_lock(fs_info, folio, cur, cur_len); > > > if (ret < 0) > > > found_error = true; > > > -- > > > 2.47.1 > > > > > >