Re: [PATCH v2 5/9] btrfs: do proper folio cleanup when run_delalloc_nocow() failed

Boris Burkov <boris@xxxxxx> · Thu, 9 Jan 2025 15:26:59 -0800

On Thu, Dec 12, 2024 at 04:43:59PM +1030, Qu Wenruo wrote:
> [BUG]
> With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
> with the following VM_BUG_ON_FOLIO():
> 
>  BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
>  BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
>  page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
>  aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
>  flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
>  page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
>  ------------[ cut here ]------------
>  kernel BUG at mm/page-writeback.c:2992!
>  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>  CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G           OE      6.12.0-rc7-custom+ #87
>  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
>  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
>  pc : folio_clear_dirty_for_io+0x128/0x258
>  lr : folio_clear_dirty_for_io+0x128/0x258
>  Call trace:
>   folio_clear_dirty_for_io+0x128/0x258
>   btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
>   __process_folios_contig+0x154/0x268 [btrfs]
>   extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
>   run_delalloc_nocow+0x5f8/0x760 [btrfs]
>   btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
>   writepage_delalloc+0x230/0x4c8 [btrfs]
>   extent_writepage+0xb8/0x358 [btrfs]
>   extent_write_cache_pages+0x21c/0x4e8 [btrfs]
>   btrfs_writepages+0x94/0x150 [btrfs]
>   do_writepages+0x74/0x190
>   filemap_fdatawrite_wbc+0x88/0xc8
>   start_delalloc_inodes+0x178/0x3a8 [btrfs]
>   btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
>   shrink_delalloc+0x114/0x280 [btrfs]
>   flush_space+0x250/0x2f8 [btrfs]
>   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
>   process_one_work+0x164/0x408
>   worker_thread+0x25c/0x388
>   kthread+0x100/0x118
>   ret_from_fork+0x10/0x20
>  Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
>  ---[ end trace 0000000000000000 ]---
> 
> [CAUSE]
> The first two lines of extra debug messages show the problem is caused
> by the error handling of run_delalloc_nocow().
> 
> E.g. we have the following dirtied range (4K blocksize 4K page size):
> 
>     0                 16K                  32K
>     |//////////////////////////////////////|
>     |  Pre-allocated  |
> 
> And the range [0, 16K) has a preallocated extent.
> 
> - Enter run_delalloc_nocow() for range [0, 16K)
>   Which found range [0, 16K) is preallocated, can do the proper NOCOW
>   write.
> 
> - Enter fallback_to_fow() for range [16K, 32K)
>   Since the range [16K, 32K) is not backed by preallocated extent, we
>   have to go COW.
> 
> - cow_file_range() failed for range [16K, 32K)
>   So cow_file_range() will do the clean up by clearing folio dirty,
>   unlock the folios.
> 
>   Now the folios in range [16K, 32K) is unlocked.
> 
> - Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
>   Which is called with PAGE_START_WRITEBACK to start page writeback.
>   But folios can only be marked writeback when it's properly locked,
>   thus this triggered the VM_BUG_ON_FOLIO().
> 
> Furthermore there is another hidden but common bug that
> run_delalloc_nocow() is not clearing the folio dirty flags in its error
> handling path.
> This is the common bug shared between run_delalloc_nocow() and
> cow_file_range().
> 
> [FIX]
> - Clear folio dirty for range [@start, @cur_offset)
>   Introduce a helper, cleanup_dirty_folios(), which
>   will find and lock the folio in the range, clear the dirty flag and
>   start/end the writeback, with the extra handling for the
>   @locked_folio.
> 
> - Introduce a helper to record the last failed COW range end
>   This is to trace which range we should skip, to avoid double
>   unlocking.
> 
> - Skip the failed COW range for the error handling
> 
> Cc: stable@xxxxxxxxxxxxxxx

Reviewed-by: Boris Burkov <boris@xxxxxx>
> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> ---
>  fs/btrfs/inode.c | 93 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 86 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 19c88b7d0363..bae8aceb3eae 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1961,6 +1961,48 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>  	return ret < 0 ? ret : can_nocow;
>  }
>  

I like this function. Can you add a simple doc with pre and post
conditions please?

> +static void cleanup_dirty_folios(struct btrfs_inode *inode,
> +				 struct folio *locked_folio,
> +				 u64 start, u64 end, int error)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct address_space *mapping = inode->vfs_inode.i_mapping;
> +	pgoff_t start_index = start >> PAGE_SHIFT;
> +	pgoff_t end_index = end >> PAGE_SHIFT;
> +	u32 len;
> +
> +	ASSERT(end + 1 - start < U32_MAX);
> +	ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
> +	       IS_ALIGNED(end + 1, fs_info->sectorsize));
> +	len = end + 1 - start;
> +
> +	/*
> +	 * Handle the locked folio first.
> +	 * btrfs_folio_clamp_*() helpers can handle range out of the folio case.
> +	 */
> +	btrfs_folio_clamp_clear_dirty(fs_info, locked_folio, start, len);
> +	btrfs_folio_clamp_set_writeback(fs_info, locked_folio, start, len);
> +	btrfs_folio_clamp_clear_writeback(fs_info, locked_folio, start, len);

Could this clear dirty; set writeback; clear writeback sequence benefit
from a good name and a helper function too?

> +
> +	for (pgoff_t index = start_index; index <= end_index; index++) {
> +		struct folio *folio;
> +
> +		/* Already handled at the beginning. */
> +		if (index == locked_folio->index)
> +			continue;
> +		folio = __filemap_get_folio(mapping, index, FGP_LOCK, GFP_NOFS);
> +		/* Cache already dropped, no need to do any cleanup. */
> +		if (IS_ERR(folio))
> +			continue;
> +		btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
> +		btrfs_folio_clamp_set_writeback(fs_info, folio, start, len);
> +		btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len);
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +	mapping_set_error(mapping, error);
> +}
> +
>  /*
>   * when nowcow writeback call back.  This checks for snapshots or COW copies
>   * of the extents that exist in the file, and COWs the file as required.
> @@ -1976,6 +2018,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_path *path;
>  	u64 cow_start = (u64)-1;
> +	/*
> +	 * If not 0, represents the inclusive end of the last fallback_to_cow()
> +	 * range. Only for error handling.
> +	 */
> +	u64 cow_end = 0;
>  	u64 cur_offset = start;
>  	int ret;
>  	bool check_prev = true;
> @@ -2136,6 +2183,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  					      found_key.offset - 1);
>  			cow_start = (u64)-1;
>  			if (ret) {
> +				cow_end = found_key.offset - 1;
>  				btrfs_dec_nocow_writers(nocow_bg);
>  				goto error;
>  			}
> @@ -2209,11 +2257,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  		cow_start = cur_offset;
>  
>  	if (cow_start != (u64)-1) {
> -		cur_offset = end;
>  		ret = fallback_to_cow(inode, locked_folio, cow_start, end);
>  		cow_start = (u64)-1;
> -		if (ret)
> +		if (ret) {
> +			cow_end = end;
>  			goto error;
> +		}
>  	}
>  
>  	btrfs_free_path(path);
> @@ -2221,12 +2270,42 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>  
>  error:
>  	/*
> -	 * If an error happened while a COW region is outstanding, cur_offset
> -	 * needs to be reset to cow_start to ensure the COW region is unlocked
> -	 * as well.
> +	 * There are several error cases:
> +	 *
> +	 * 1) Failed without falling back to COW
> +	 *    start         cur_start              end
> +	 *    |/////////////|                      |
> +	 *
> +	 *    For range [start, cur_start) the folios are already unlocked (except
> +	 *    @locked_folio), EXTENT_DELALLOC already removed.
> +	 *    Only need to clear the dirty flag as they will never be submitted.
> +	 *    Ordered extent and extent maps are handled by
> +	 *    btrfs_mark_ordered_io_finished() inside run_delalloc_range().
> +	 *
> +	 * 2) Failed with error from fallback_to_cow()
> +	 *    start         cur_start   cow_end    end
> +	 *    |/////////////|-----------|          |
> +	 *
> +	 *    For range [start, cur_start) it's the same as case 1).
> +	 *    But for range [cur_start, cow_end), the folios have dirty flag
> +	 *    cleared and unlocked, EXTENT_DEALLLOC cleared.
> +	 *    There may or may not be any ordered extents/extent maps allocated.
> +	 *
> +	 *    We should not call extent_clear_unlock_delalloc() on range [cur_start,
> +	 *    cow_end), as the folios are already unlocked.
> +	 *

I think it would be helpful to include cur_offset in your drawings.

> +	 * So clear the folio dirty flags for [start, cur_offset) first.
>  	 */
> -	if (cow_start != (u64)-1)
> -		cur_offset = cow_start;
> +	if (cur_offset > start)
> +		cleanup_dirty_folios(inode, locked_folio, start, cur_offset - 1, ret);
> +
> +	/*
> +	 * If an error happened while a COW region is outstanding, cur_offset
> +	 * needs to be reset to @cow_end + 1 to skip the COW range, as
> +	 * cow_file_range() will do the proper cleanup at error.
> +	 */
> +	if (cow_end)
> +		cur_offset = cow_end + 1;
>  
>  	/*
>  	 * We need to lock the extent here because we're clearing DELALLOC and
> -- 
> 2.47.1
>