In a write heavy workload, the following scenario can occur: 1. mark page #0 to page #2 (and their corresponding extent region) as dirty and candidate for delayed allocation pages 0 1 2 3 4 dirty o o o - - towrite - - - - - delayed o o o - - alloc 2. extent_write_cache_pages() mark dirty pages as TOWRITE pages 0 1 2 3 4 dirty o o o - - towrite o o o - - delayed o o o - - alloc 3. Meanwhile, another write dirties page #3 and page #4 pages 0 1 2 3 4 dirty o o o o o towrite o o o - - delayed o o o o o alloc 4. find_lock_delalloc_range() decide to allocate a region to write page #0 to page #4 5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged pages (#0 to #2) So the above process leaves page #3 and page #4 behind. Usually, the periodic dirty flush kicks write IOs for page #3 and #4. However, if we try to mount a subvolume at this timing, mount process takes s_umount write lock to block the periodic flush to come in. To deal with the problem, shrink the delayed allocation region to have only expected to be written pages. Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx> --- fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c73c69e2bef4..ea582ff85c73 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, delalloc_start = delalloc_end + 1; continue; } + + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) && + (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) && + ((delalloc_start >> PAGE_SHIFT) < + (delalloc_end >> PAGE_SHIFT))) { + unsigned long i; + unsigned long end_index = delalloc_end >> PAGE_SHIFT; + + for (i = delalloc_start >> PAGE_SHIFT; + i <= end_index; i++) + if (!xa_get_mark(&inode->i_mapping->i_pages, i, + PAGECACHE_TAG_TOWRITE)) + break; + + if (i <= end_index) { + u64 unlock_start = (u64)i << PAGE_SHIFT; + + if (i == delalloc_start >> PAGE_SHIFT) + unlock_start += PAGE_SIZE; + + unlock_extent(tree, unlock_start, delalloc_end); + __unlock_for_delalloc(inode, page, unlock_start, + delalloc_end); + delalloc_end = unlock_start - 1; + } + } + ret = btrfs_run_delalloc_range(inode, page, delalloc_start, delalloc_end, &page_started, nr_written, wbc); /* File system has been set read-only */ -- 2.21.0