Oops, this is the version 2 of the patch. Best, Hui Hui Zhu <teawaterz@xxxxxxxxxxxxxxxxx> 于2023年8月11日周五 15:15写道: > > From: Hui Zhu <teawater@xxxxxxxxxxxx> > > This version fix the gfp flags in the callers instead of working this > new "bool" flag through the buffer head layers according to the comments > from Matthew Wilcox. > > Encountered an issue where a large number of filesystem reads and writes > occurred suddenly within a container. At the same time, other tasks on > the same host that were performing filesystem read and write operations > became blocked. It was observed that many of the blocked tasks were > blocked on the ext4 journal lock. For example: > PID: 171453 TASK: ffff926566c9440 CPU: 54 COMMAND: "Thread" > 0 [] __schedule at xxxxxxxxxxxxxxx > 1 [] schedule at xxxxxxxxxxxxxxx > 2 [] wait_transaction_locked at xxxxxxxxxxxxxxx > 3 [] add_transaction_credits at xxxxxxxxxxxxxxx > 4 [] start_this_handle at xxxxxxxxxxxxxxx > 5 [] jbd2__journal_start at xxxxxxxxxxxxxxx > 6 [] ext4_journal_start_sb at xxxxxxxxxxxxxxx > 7 [] ext4_dirty_inode at xxxxxxxxxxxxxxx > 8 [] mark_inode_dirty at xxxxxxxxxxxxxxx > 9 [] generic_update_time at xxxxxxxxxxxxxxx > > Meanwhile, it was observed that the task holding the ext4 journal lock > was blocked for an extended period of time on "shrink_page_list" due to > "ext4_sb_breadahead_unmovable". > 0 [] __schedule at xxxxxxxxxxxxxxx > 1 [] _cond_resched at xxxxxxxxxxxxxxx > 2 [] shrink_page_list at xxxxxxxxxxxxxxx > 3 [] shrink_inactive_list at xxxxxxxxxxxxxxx > 4 [] shrink_lruvec at xxxxxxxxxxxxxxx > 5 [] shrink_node_memcgs at xxxxxxxxxxxxxxx > 6 [] shrink_node at xxxxxxxxxxxxxxx > 7 [] shrink_zones at xxxxxxxxxxxxxxx > 8 [] do_try_to_free_pages at xxxxxxxxxxxxxxx > 9 [] try_to_free_mem_cgroup_pages at xxxxxxxxxxxxxxx > 10 [] try_charge at xxxxxxxxxxxxxxx > 11 [] mem_cgroup_charge at xxxxxxxxxxxxxxx > 12 [] __add_to_page_cache_locked at xxxxxxxxxxxxxxx > 13 [] add_to_page_cache_lru at xxxxxxxxxxxxxxx > 14 [] pagecache_get_page at xxxxxxxxxxxxxxx > 15 [] grow_dev_page at xxxxxxxxxxxxxxx > 16 [] __getblk_slow at xxxxxxxxxxxxxxx > 17 [] ext4_sb_breadahead_unmovable at xxxxxxxxxxxxxxx > 18 [] __ext4_get_inode_loc at xxxxxxxxxxxxxxx > 19 [] ext4_get_inode_loc at xxxxxxxxxxxxxxx > 20 [] ext4_reserve_inode_write at xxxxxxxxxxxxxxx > 21 [] __ext4_mark_inode_dirty at xxxxxxxxxxxxxxx > 22 [] add_dirent_to_buf at xxxxxxxxxxxxxxx > 23 [] ext4_add_entry at xxxxxxxxxxxxxxx > 24 [] ext4_add_nondir at xxxxxxxxxxxxxxx > 25 [] ext4_create at xxxxxxxxxxxxxxx > 26 [] vfs_create at xxxxxxxxxxxxxxx > > The function "grow_dev_page" increased the gfp mask with "__GFP_NOFAIL", > causing longer blocking times. > /* > * XXX: __getblk_slow() can not really deal with failure and > * will endlessly loop on improvised global reclaim. Prefer > * looping in the allocator rather than here, at least that > * code knows what it's doing. > */ > gfp_mask |= __GFP_NOFAIL; > However, "ext4_sb_breadahead_unmovable" is a prefetch function and > failures are acceptable. > > Therefore, this commit changes "ext4_sb_breadahead_unmovable" to be > non-blocking. > Change gfp to ~__GFP_DIRECT_RECLAIM when ext4_sb_breadahead_unmovable > calls sb_getblk_gfp. > Modify grow_dev_page to will not be blocked by the allocation of folio > if gfp is ~__GFP_DIRECT_RECLAIM. > > Signed-off-by: Hui Zhu <teawater@xxxxxxxxxxxx> > --- > fs/buffer.c | 27 +++++++++++++++++++-------- > fs/ext4/super.c | 3 ++- > 2 files changed, 21 insertions(+), 9 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index bd091329026c..330cf19c77b1 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -1038,6 +1038,7 @@ static sector_t folio_init_buffers(struct folio *folio, > * Create the page-cache page that contains the requested block. > * > * This is used purely for blockdev mappings. > + * Will not blocking by allocate folio if gfp is ~__GFP_DIRECT_RECLAIM. > */ > static int > grow_dev_page(struct block_device *bdev, sector_t block, > @@ -1050,18 +1051,27 @@ grow_dev_page(struct block_device *bdev, sector_t block, > int ret = 0; > gfp_t gfp_mask; > > - gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp; > + gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS); > + if (gfp == ~__GFP_DIRECT_RECLAIM) > + gfp_mask &= ~__GFP_DIRECT_RECLAIM; > + else { > + gfp_mask |= gfp; > > - /* > - * XXX: __getblk_slow() can not really deal with failure and > - * will endlessly loop on improvised global reclaim. Prefer > - * looping in the allocator rather than here, at least that > - * code knows what it's doing. > - */ > - gfp_mask |= __GFP_NOFAIL; > + /* > + * XXX: __getblk_slow() can not really deal with failure and > + * will endlessly loop on improvised global reclaim. Prefer > + * looping in the allocator rather than here, at least that > + * code knows what it's doing. > + */ > + gfp_mask |= __GFP_NOFAIL; > + } > > folio = __filemap_get_folio(inode->i_mapping, index, > FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask); > + if (IS_ERR(folio)) { > + ret = PTR_ERR(folio); > + goto out; > + } > > bh = folio_buffers(folio); > if (bh) { > @@ -1091,6 +1101,7 @@ grow_dev_page(struct block_device *bdev, sector_t block, > failed: > folio_unlock(folio); > folio_put(folio); > +out: > return ret; > } > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index c94ebf704616..6a529509b83b 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -254,7 +254,8 @@ struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb, > > void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block) > { > - struct buffer_head *bh = sb_getblk_gfp(sb, block, 0); > + struct buffer_head *bh = sb_getblk_gfp(sb, block, > + ~__GFP_DIRECT_RECLAIM); > > if (likely(bh)) { > if (trylock_buffer(bh)) > -- > 2.19.1.6.gb485710b >