Re: generic/475 deadlock?

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 21 Mar 2019 08:39:33 +1100

On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> Hmmm.
> 
> Every now and then I see a generic/475 deadlock that generates the
> hangcheck warning pasted below.
> 
> I /think/ this is ... the ail is processing an inode log item, for which
> it locked the cluster buffer and pushed the cil to unpin the buffer.
> However, the cil is cleaning up after the shut down and is trying to
> simulate an EIO completion, but tries grabs the buffer lock and hence
> the cil and ail deadlock.  Maybe the solution is to trylock in the
> (freed && remove) case of xfs_buf_item_unpin, since we're tearing the
> whole system down anyway?

Oh, that's looks like a bug in xfs_iflush() - we are forcing the log
to unpin a buffer we already own the lock on. It's the same problem
we had in the discard code fixed by commit 8c81dd46ef3c ("Force log
to disk before reading the AGF during a fstrim").

It also means that the log forces in the busy extent code have the
same potential problem, as does xfs_qm_dqflush().

I'll move further down the discussion now....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx