On Fri, Jun 07, 2019 at 10:10:23PM +0900, Naohiro Aota wrote: > In a write heavy workload, the following scenario can occur: > > 1. mark page #0 to page #2 (and their corresponding extent region) as dirty > and candidate for delayed allocation > > pages 0 1 2 3 4 > dirty o o o - - > towrite - - - - - > delayed o o o - - > alloc > > 2. extent_write_cache_pages() mark dirty pages as TOWRITE > > pages 0 1 2 3 4 > dirty o o o - - > towrite o o o - - > delayed o o o - - > alloc > > 3. Meanwhile, another write dirties page #3 and page #4 > > pages 0 1 2 3 4 > dirty o o o o o > towrite o o o - - > delayed o o o o o > alloc > > 4. find_lock_delalloc_range() decide to allocate a region to write page #0 > to page #4 > 5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged > pages (#0 to #2) > > So the above process leaves page #3 and page #4 behind. Usually, the > periodic dirty flush kicks write IOs for page #3 and #4. However, if we try > to mount a subvolume at this timing, mount process takes s_umount write > lock to block the periodic flush to come in. > > To deal with the problem, shrink the delayed allocation region to have only > expected to be written pages. > > Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx> > --- > fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index c73c69e2bef4..ea582ff85c73 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, > delalloc_start = delalloc_end + 1; > continue; > } > + > + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) && > + (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) && > + ((delalloc_start >> PAGE_SHIFT) < > + (delalloc_end >> PAGE_SHIFT))) { > + unsigned long i; > + unsigned long end_index = delalloc_end >> PAGE_SHIFT; > + > + for (i = delalloc_start >> PAGE_SHIFT; > + i <= end_index; i++) > + if (!xa_get_mark(&inode->i_mapping->i_pages, i, > + PAGECACHE_TAG_TOWRITE)) > + break; > + > + if (i <= end_index) { > + u64 unlock_start = (u64)i << PAGE_SHIFT; > + > + if (i == delalloc_start >> PAGE_SHIFT) > + unlock_start += PAGE_SIZE; > + > + unlock_extent(tree, unlock_start, delalloc_end); > + __unlock_for_delalloc(inode, page, unlock_start, > + delalloc_end); > + delalloc_end = unlock_start - 1; > + } > + } > + Helper please. Really for all this hmzoned stuff I want it segregated as much as possible so when I'm debugging or cleaning other stuff up I want to easily be able to say "oh this is for zoned devices, it doesn't matter." Thanks, Josef