On Thu, Mar 10, 2022 at 03:46:50PM -0800, Darrick J. Wong wrote: > On Wed, Mar 09, 2022 at 12:55:09PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > After 963 iterations of generic/530, it deadlocked during recovery > > on a pinned inode cluster buffer like so: .... > > What has happened here is that the AIL push thread has raced with > > the inodegc process modifying, committing and pinning the inode > > cluster buffer here in xfs_buf_delwri_submit_buffers() here: > > > > blk_start_plug(&plug); > > list_for_each_entry_safe(bp, n, buffer_list, b_list) { > > if (!wait_list) { > > if (xfs_buf_ispinned(bp)) { > > pinned++; > > continue; > > } > > Here >>>>>> > > if (!xfs_buf_trylock(bp)) > > continue; > > > > Basically, the AIL has found the buffer wasn't pinned and got the > > lock without blocking, but then the buffer was pinned. This implies > > the processing here was pre-empted between the pin check and the > > lock, because the pin count can only be increased while holding the > > buffer locked. Hence when it has gone to submit the IO, it has > > blocked waiting for the buffer to be unpinned. > > > > With all executing threads now waiting on the buffer to be unpinned, > > we normally get out of situations like this via the background log > > worker issuing a log force which will unpinned stuck buffers like > > this. But at this point in recovery, we haven't started the log > > worker. In fact, the first thing we do after processing intents and > > unlinked inodes is *start the log worker*. IOWs, we start it too > > late to have it break deadlocks like this. > > Because finishing the intents, processing unlinked inodes, and freeing > dead COW extents are all just regular transactional updates that run > after sorting out the log contents, there's no reason why the log worker > oughtn't be running, right? Yes. > > Avoid this and any other similar deadlock vectors in intent and > > unlinked inode recovery by starting the log worker before we recover > > intents and unlinked inodes. This part of recovery runs as though > > the filesystem is fully active, so we really should have the same > > infrastructure running as we normally do at runtime. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > --- > > fs/xfs/xfs_log.c | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > index 89fec9a18c34..ffd928cf9a9a 100644 > > --- a/fs/xfs/xfs_log.c > > +++ b/fs/xfs/xfs_log.c > > @@ -812,10 +812,9 @@ xfs_log_mount_finish( > > * mount failure occurs. > > */ > > mp->m_super->s_flags |= SB_ACTIVE; > > + xfs_log_work_queue(mp); > > if (xlog_recovery_needed(log)) > > error = xlog_recover_finish(log); > > - if (!error) > > - xfs_log_work_queue(mp); > > I /think/ in the error case, we'll cancel and wait for the worker in > xfs_mountfs -> xfs_log_mount_cancel -> xfs_log_unmount -> xfs_log_clean > -> xfs_log_quiesce, right? Yeah, It took me a while to convince myself we did actually tear it down correctly on later failures in xfs_mountfs() because this is a bit of a twisty path. > TBH I'd tried to solve these g/530 hangs by making this exact change, so > assuming the answers are {yes, yes}, then: > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> Thanks! -Dave. -- Dave Chinner david@xxxxxxxxxxxxx