In ext4_[da_]write_begin(), grab_cache_page_write_begin() is being called without GFP_NOFS set for the context. This is considered adequate as any eventual memory reclaim triggered by a page allocation is being done before the transaction handle for the write operation is started. However, with the following setup: * fo creating and writing files on XFS * XFS file system on top of dm-zoned target device * dm-zoned target created on tcmu-runner emulated ZBC disk * emulated ZBC disk backend file on ext4 * ext4 file system on regular SATA SSD A deadlock was observed under the heavy file write workload generated using fio. The deadlock is clearly apparent from the backtrace of the tcmu-runner handler task backtrace: tcmu-runner call Trace: wait_for_completion+0x12c/0x170 <-- deadlock (see below text) xfs_buf_iowait+0x39/0x1c0 [xfs] __xfs_buf_submit+0x118/0x2e0 [xfs] <-- XFS issues writes to dm-zoned device, which in turn will issue writes to the emulated ZBC device handled by tcmu-runner. xfs_bwrite+0x25/0x60 [xfs] xfs_reclaim_inode+0x303/0x330 [xfs] xfs_reclaim_inodes_ag+0x223/0x450 [xfs] xfs_reclaim_inodes_nr+0x31/0x40 [xfs] <-- Absence of GFP_NOFS allows reclaim to call in XFS reclaim for the XFS file system on the emulated device super_cache_scan+0x153/0x1a0 do_shrink_slab+0x17b/0x3c0 shrink_slab+0x170/0x2c0 shrink_node+0x1d6/0x4a0 do_try_to_free_pages+0xdb/0x3c0 try_to_free_pages+0x112/0x2e0 <-- Page reclaim triggers __alloc_pages_slowpath+0x422/0x1020 __alloc_pages_nodemask+0x37f/0x400 pagecache_get_page+0xb4/0x390 grab_cache_page_write_begin+0x1d/0x40 ext4_da_write_begin+0xd6/0x530 generic_perform_write+0xc2/0x1e0 __generic_file_write_iter+0xf9/0x1d0 ext4_file_write_iter+0xc6/0x3b0 new_sync_write+0x12d/0x1d0 vfs_write+0xdb/0x1d0 ksys_pwrite64+0x65/0xa0 <-- tcmu-runner ZBC handler writes to the backend file in response to dm-zoned target write IO When XFS reclaim issues write IOs to the dm-zoned target device, dm-zoned issue write IOs to the tcmu-runner emulated device. However, the tcmu ZBC handler is singled threaded and blocked waiting for the completion of the started pwrite() call and so does not process the newly issued write IOs necessary to complete page reclaim. The system is in a deadlocked state. This problem is 100% reproducible, the deadlock happening after fio running for a few minutes. A similar deadlock was also observed for read operations into the ext4 backend file on page cache miss trigering a page allocation and page reclaim. Switching the tcmu emulated ZBC disk backend file from an ext4 file to an XFS file, none of these deadlocks are observed. Fix this problem by removing __GFP_FS from ext4 inode mapping gfp_mask. The code used for this fix is borrowed from XFS xfs_setup_inode(). The inode mapping gfp_mask initialization is added to ext4_set_aops(). Reported-by: Masato Suzuki <masato.suzuki@xxxxxxx> Signed-off-by: Damien Le Moal <damien.lemoal@xxxxxxx> --- fs/ext4/inode.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 420fe3deed39..f882929037df 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1292,8 +1292,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping, * grab_cache_page_write_begin() can take a long time if the * system is thrashing due to memory pressure, or if the page * is being written back. So grab it first before we start - * the transaction handle. This also allows us to allocate - * the page (if needed) without using GFP_NOFS. + * the transaction handle. */ retry_grab: page = grab_cache_page_write_begin(mapping, index, flags); @@ -3084,8 +3083,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping, * grab_cache_page_write_begin() can take a long time if the * system is thrashing due to memory pressure, or if the page * is being written back. So grab it first before we start - * the transaction handle. This also allows us to allocate - * the page (if needed) without using GFP_NOFS. + * the transaction handle. */ retry_grab: page = grab_cache_page_write_begin(mapping, index, flags); @@ -4003,6 +4001,8 @@ static const struct address_space_operations ext4_dax_aops = { void ext4_set_aops(struct inode *inode) { + gfp_t gfp_mask; + switch (ext4_inode_journal_mode(inode)) { case EXT4_INODE_ORDERED_DATA_MODE: case EXT4_INODE_WRITEBACK_DATA_MODE: @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode) inode->i_mapping->a_ops = &ext4_da_aops; else inode->i_mapping->a_ops = &ext4_aops; + + /* + * Ensure all page cache allocations are done from GFP_NOFS context to + * prevent direct reclaim recursion back into the filesystem and blowing + * stacks or deadlocking. + */ + gfp_mask = mapping_gfp_mask(inode->i_mapping); + mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS))); } static int __ext4_block_zero_page_range(handle_t *handle, -- 2.21.0