[PATCH] ext4: Fix deadlock on page reclaim

Damien Le Moal <damien.lemoal@xxxxxxx> · Thu, 25 Jul 2019 18:33:58 +0900

In ext4_[da_]write_begin(), grab_cache_page_write_begin() is being
called without GFP_NOFS set for the context. This is considered adequate
as any eventual memory reclaim triggered by a page allocation is being
done before the transaction handle for the write operation is started.

However, with the following setup:

* fo creating and writing files on XFS
* XFS file system on top of dm-zoned target device
* dm-zoned target created on tcmu-runner emulated ZBC disk
* emulated ZBC disk backend file on ext4
* ext4 file system on regular SATA SSD

A deadlock was observed under the heavy file write workload generated
using fio. The deadlock is clearly apparent from the backtrace of the
tcmu-runner handler task backtrace:

tcmu-runner call Trace:
wait_for_completion+0x12c/0x170		<-- deadlock (see below text)
xfs_buf_iowait+0x39/0x1c0 [xfs]
__xfs_buf_submit+0x118/0x2e0 [xfs]	<-- XFS issues writes to
				            dm-zoned device, which in
					    turn will issue writes to
					    the emulated ZBC device
					    handled by tcmu-runner.
xfs_bwrite+0x25/0x60 [xfs]
xfs_reclaim_inode+0x303/0x330 [xfs]
xfs_reclaim_inodes_ag+0x223/0x450 [xfs]
xfs_reclaim_inodes_nr+0x31/0x40 [xfs]	<-- Absence of GFP_NOFS allows
					    reclaim to call in XFS
					    reclaim for the XFS file
					    system on the emulated
					    device
super_cache_scan+0x153/0x1a0
do_shrink_slab+0x17b/0x3c0
shrink_slab+0x170/0x2c0
shrink_node+0x1d6/0x4a0
do_try_to_free_pages+0xdb/0x3c0
try_to_free_pages+0x112/0x2e0		<-- Page reclaim triggers
__alloc_pages_slowpath+0x422/0x1020
__alloc_pages_nodemask+0x37f/0x400
pagecache_get_page+0xb4/0x390
grab_cache_page_write_begin+0x1d/0x40
ext4_da_write_begin+0xd6/0x530
generic_perform_write+0xc2/0x1e0
__generic_file_write_iter+0xf9/0x1d0
ext4_file_write_iter+0xc6/0x3b0
new_sync_write+0x12d/0x1d0
vfs_write+0xdb/0x1d0
ksys_pwrite64+0x65/0xa0		<-- tcmu-runner ZBC handler writes to
				    the backend file in response to
				    dm-zoned target write IO

When XFS reclaim issues write IOs to the dm-zoned target device,
dm-zoned issue write IOs to the tcmu-runner emulated device. However,
the tcmu ZBC handler is singled threaded and blocked waiting for the
completion of the started pwrite() call and so does not process the
newly issued write IOs necessary to complete page reclaim. The system
is in a deadlocked state. This problem is 100% reproducible, the
deadlock happening after fio running for a few minutes.

A similar deadlock was also observed for read operations into the ext4
backend file on page cache miss trigering a page allocation and page
reclaim. Switching the tcmu emulated ZBC disk backend file from an ext4
file to an XFS file, none of these deadlocks are observed.

Fix this problem by removing __GFP_FS from ext4 inode mapping gfp_mask.
The code used for this fix is borrowed from XFS xfs_setup_inode(). The
inode mapping gfp_mask initialization is added to ext4_set_aops().

Reported-by: Masato Suzuki <masato.suzuki@xxxxxxx>
Signed-off-by: Damien Le Moal <damien.lemoal@xxxxxxx>
---
 fs/ext4/inode.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..f882929037df 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1292,8 +1292,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 	 * grab_cache_page_write_begin() can take a long time if the
 	 * system is thrashing due to memory pressure, or if the page
 	 * is being written back.  So grab it first before we start
-	 * the transaction handle.  This also allows us to allocate
-	 * the page (if needed) without using GFP_NOFS.
+	 * the transaction handle.
 	 */
 retry_grab:
 	page = grab_cache_page_write_begin(mapping, index, flags);
@@ -3084,8 +3083,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 	 * grab_cache_page_write_begin() can take a long time if the
 	 * system is thrashing due to memory pressure, or if the page
 	 * is being written back.  So grab it first before we start
-	 * the transaction handle.  This also allows us to allocate
-	 * the page (if needed) without using GFP_NOFS.
+	 * the transaction handle.
 	 */
 retry_grab:
 	page = grab_cache_page_write_begin(mapping, index, flags);
@@ -4003,6 +4001,8 @@ static const struct address_space_operations ext4_dax_aops = {
 
 void ext4_set_aops(struct inode *inode)
 {
+	gfp_t gfp_mask;
+
 	switch (ext4_inode_journal_mode(inode)) {
 	case EXT4_INODE_ORDERED_DATA_MODE:
 	case EXT4_INODE_WRITEBACK_DATA_MODE:
@@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
 		inode->i_mapping->a_ops = &ext4_da_aops;
 	else
 		inode->i_mapping->a_ops = &ext4_aops;
+
+	/*
+	 * Ensure all page cache allocations are done from GFP_NOFS context to
+	 * prevent direct reclaim recursion back into the filesystem and blowing
+	 * stacks or deadlocking.
+	 */
+	gfp_mask = mapping_gfp_mask(inode->i_mapping);
+	mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
 }
 
 static int __ext4_block_zero_page_range(handle_t *handle,
-- 
2.21.0