Re: generic/475 deadlock?

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 21 Mar 2019 08:49:38 +1100

On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote:
> On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > > And unmount is doing a log force as well..
> > > 
> > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
> > 
> > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
> > returns the locked cluster buffer (code extract from xfs_iflush):
....
> > ...but if the buffer is also pinned then we kick the log (while holding
> > the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
> > will just be trying to remove everything, but it needs to lock the
> > buffer to simulate EIO.
> > 
> 
> Yeah, makes sense.

Yup, and this same problem means any buffer we hold locked when we
issue a blocking log force can deadlock in the same way. If it's a
non-blocking log force (i.e. a log flush we don't wait for) then we
can move onwards and eventually we unlock the buffer and the log can
make progress.

> > A naive fix to the deadlock is to see if the inode buf ops are attached
> > and downgrade to a trylock and go ahead even if we can't lock it, but ick.
> > It resolves the deadlock so the umount can proceed but we also trigger a
> > lot of unlocked buffer asserts.

Yuck!

> > I'm not sure what "too long" means in that code chunk.  It was added to
> > reduce stalls in pdflush, which these days I think translates to trying
> > to reduce the amount of time reclaim can stall while flushing inodes...?

"too long" means that if there is no log pressure, it might be the
next periodic log work that forces the log and unpins the buffer.
i.e. it could be several seconds before the buffer gets unpinned.
This is the same reason we have the log force in xfs_buf_lock() - if
we have to wait for the next log force to unlock/unpin a stale
buffer, we can stall for several seconds.

> I'm not sure either. I was wondering why we'd need that given that
> xfsaild() has its own log force logic in cases where the delwri submit
> didn't fully complete (i.e., for pinned buffers). Relying on that seems
> more appropriate since we'd force the log per delwri submission for
> multiple potential buffers as opposed to per inode cluster buffer.
> Perhaps we could just kill this log force?

If the writeback is coming from the AIL, then we could check if the
buffer is pinned and then return EDEADLOCK or something like that to
get the item push return value set to XFS_ITEM_PINNED, in which case
the AIL will then do the log force that will unpin the object. This
will happen after we have added the buffer to the AIL delwri list
and unlocked the buffer, so everything should work just fine in that
case. Similar for dquot flushing from the AIL.

If the writeback is coming from inode reclaim, we've already
guaranteed the inode is unpinned, so we'll never hit this case
through there.

As for the busy extent code, I'll need to look at that much more
closely to determine if there's a way around this....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx