Re: [PATCH 1/7] xfs: log worker needs to start before intent/unlink recovery

Chandan Babu R <chandan.babu@xxxxxxxxxx> · Tue, 15 Mar 2022 14:44:58 +0530

On 15 Mar 2022 at 12:12, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
>
> After 963 iterations of generic/530, it deadlocked during recovery
> on a pinned inode cluster buffer like so:
>
> XFS (pmem1): Starting recovery (logdev: internal)
> INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
>       Not tainted 5.17.0-rc6-dgc+ #975
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
> Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
> Call Trace:
>  <TASK>
>  __schedule+0x30d/0x9e0
>  schedule+0x55/0xd0
>  schedule_timeout+0x114/0x160
>  __down+0x99/0xf0
>  down+0x5e/0x70
>  xfs_buf_lock+0x36/0xf0
>  xfs_buf_find+0x418/0x850
>  xfs_buf_get_map+0x47/0x380
>  xfs_buf_read_map+0x54/0x240
>  xfs_trans_read_buf_map+0x1bd/0x490
>  xfs_imap_to_bp+0x4f/0x70
>  xfs_iunlink_map_ino+0x66/0xd0
>  xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
>  xfs_iunlink_remove_inode+0xf2/0x1d0
>  xfs_inactive_ifree+0x1a3/0x900
>  xfs_inode_unlink+0xcc/0x210
>  xfs_inodegc_worker+0x1ac/0x2f0
>  process_one_work+0x1ac/0x390
>  worker_thread+0x56/0x3c0
>  kthread+0xf6/0x120
>  ret_from_fork+0x1f/0x30
>  </TASK>
> task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
> Call Trace:
>  <TASK>
>  __schedule+0x30d/0x9e0
>  schedule+0x55/0xd0
>  schedule_timeout+0x114/0x160
>  __down+0x99/0xf0
>  down+0x5e/0x70
>  xfs_buf_lock+0x36/0xf0
>  xfs_buf_find+0x418/0x850
>  xfs_buf_get_map+0x47/0x380
>  xfs_buf_read_map+0x54/0x240
>  xfs_trans_read_buf_map+0x1bd/0x490
>  xfs_imap_to_bp+0x4f/0x70
>  xfs_iget+0x300/0xb40
>  xlog_recover_process_one_iunlink+0x4c/0x170
>  xlog_recover_process_iunlinks.isra.0+0xee/0x130
>  xlog_recover_finish+0x57/0x110
>  xfs_log_mount_finish+0xfc/0x1e0
>  xfs_mountfs+0x540/0x910
>  xfs_fs_fill_super+0x495/0x850
>  get_tree_bdev+0x171/0x270
>  xfs_fs_get_tree+0x15/0x20
>  vfs_get_tree+0x24/0xc0
>  path_mount+0x304/0xba0
>  __x64_sys_mount+0x108/0x140
>  do_syscall_64+0x35/0x80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>  </TASK>
> task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
> Call Trace:
>  <TASK>
>  __schedule+0x30d/0x9e0
>  schedule+0x55/0xd0
>  io_schedule+0x4b/0x80
>  xfs_buf_wait_unpin+0x9e/0xf0
>  __xfs_buf_submit+0x14a/0x230
>  xfs_buf_delwri_submit_buffers+0x107/0x280
>  xfs_buf_delwri_submit_nowait+0x10/0x20
>  xfsaild+0x27e/0x9d0
>  kthread+0xf6/0x120
>  ret_from_fork+0x1f/0x30
>
> We have the mount process waiting on an inode cluster buffer read,
> inodegc doing unlink waiting on the same inode cluster buffer, and
> the AIL push thread blocked in writeback waiting for the inode to
> become unpinned.

... "waiting for the inode cluster buffer to become unpinned"?.

Periodic flushing of the current CIL context by the log worker task will
clearly break the deadlock described in this patch. I don't see any negative
side effects of the change presented here. Hence,

Reviewed-by: Chandan Babu R <chandan.babu@xxxxxxxxxx>

>
> What has happened here is that the AIL push thread has raced with
> the inodegc process modifying, committing and pinning the inode
> cluster buffer here in xfs_buf_delwri_submit_buffers() here:
>
> 	blk_start_plug(&plug);
> 	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
> 		if (!wait_list) {
> 			if (xfs_buf_ispinned(bp)) {
> 				pinned++;
> 				continue;
> 			}
> Here >>>>>>
> 			if (!xfs_buf_trylock(bp))
> 				continue;
>
> Basically, the AIL has found the buffer wasn't pinned and got the
> lock without blocking, but then the buffer was pinned. This implies
> the processing here was pre-empted between the pin check and the
> lock, because the pin count can only be increased while holding the
> buffer locked. Hence when it has gone to submit the IO, it has
> blocked waiting for the buffer to be unpinned.
>
> With all executing threads now waiting on the buffer to be unpinned,
> we normally get out of situations like this via the background log
> worker issuing a log force which will unpinned stuck buffers like
> this. But at this point in recovery, we haven't started the log
> worker. In fact, the first thing we do after processing intents and
> unlinked inodes is *start the log worker*. IOWs, we start it too
> late to have it break deadlocks like this.
>
> Avoid this and any other similar deadlock vectors in intent and
> unlinked inode recovery by starting the log worker before we recover
> intents and unlinked inodes. This part of recovery runs as though
> the filesystem is fully active, so we really should have the same
> infrastructure running as we normally do at runtime.
>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_log.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 9a49acd94516..b0e05fa902d4 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -833,10 +833,9 @@ xfs_log_mount_finish(
>  	 * mount failure occurs.
>  	 */
>  	mp->m_super->s_flags |= SB_ACTIVE;
> +	xfs_log_work_queue(mp);
>  	if (xlog_recovery_needed(log))
>  		error = xlog_recover_finish(log);
> -	if (!error)
> -		xfs_log_work_queue(mp);
>  	mp->m_super->s_flags &= ~SB_ACTIVE;
>  	evict_inodes(mp->m_super);

-- 
chandan