Re: generic/475 deadlock?

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 21 Mar 2019 17:53:07 -0400

On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote:
> On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote:
> > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote:
> > > On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote:
> > > > On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> > > > > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > > > > > And unmount is doing a log force as well..
> > > > > > 
> > > > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
> > > > > 
> > > > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
> > > > > returns the locked cluster buffer (code extract from xfs_iflush):
> > > ....
> > > > > ...but if the buffer is also pinned then we kick the log (while holding
> > > > > the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
> > > > > will just be trying to remove everything, but it needs to lock the
> > > > > buffer to simulate EIO.
> > > > > 
> > > > 
> > > > Yeah, makes sense.
> > > 
> > > Yup, and this same problem means any buffer we hold locked when we
> > > issue a blocking log force can deadlock in the same way. If it's a
> > > non-blocking log force (i.e. a log flush we don't wait for) then we
> > > can move onwards and eventually we unlock the buffer and the log can
> > > make progress.
> > > 
> > > > > A naive fix to the deadlock is to see if the inode buf ops are attached
> > > > > and downgrade to a trylock and go ahead even if we can't lock it, but ick.
> > > > > It resolves the deadlock so the umount can proceed but we also trigger a
> > > > > lot of unlocked buffer asserts.
> > > 
> > > Yuck!
> > > 
> > > > > I'm not sure what "too long" means in that code chunk.  It was added to
> > > > > reduce stalls in pdflush, which these days I think translates to trying
> > > > > to reduce the amount of time reclaim can stall while flushing inodes...?
> > > 
> > > "too long" means that if there is no log pressure, it might be the
> > > next periodic log work that forces the log and unpins the buffer.
> > > i.e. it could be several seconds before the buffer gets unpinned.
> > > This is the same reason we have the log force in xfs_buf_lock() - if
> > > we have to wait for the next log force to unlock/unpin a stale
> > > buffer, we can stall for several seconds.
> > > 
> > 
> > Ok.. so it kind of looks to me that this particular instance of the log
> > force is a historical artifact of xfsaild. The latter used to use
> > separate ->trylock() and ->push() log item callbacks where the former
> > would either lock the object or return LOCKED, PINNED, etc. So a pinned
> > buffer or inode would return pinned here and xfsaild does the log force
> > on its behalf once it's through the current cycle.
> 
> Sure. that's no different to now. The only thing that is different
> is that we combined ->trylock, ->push and ->pushbuf into one
> function. The actual semantics and behaviour is unchanged.
> 
> i.e. if you look at xfs_iflush(ip, ASYNC) back then, it did:
> 
>         /*
>          * If the buffer is pinned then push on the log now so we won't
>          * get stuck waiting in the write for too long.
>          */
>         if (XFS_BUF_ISPINNED(bp))
>                 xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE);
> 
> <snip iflush_cluster to pull in all other dirty inodes>
> 
>         if (flags & INT_DELWRI) {
>                 xfs_bdwrite(mp, bp);
>         } else if (flags & INT_ASYNC) {
>                 error = xfs_bawrite(mp, bp);
>         } else {
>                 error = xfs_bwrite(mp, bp);
>         }
> 
> And then it would write the inode buffer directly with an async write
> via xfs_bawrite(). i.e. there was no delwri list or anything like
> that - the inode flush controlled the inode buffer writeback and
> hence the unpinning of the inode buffer. The only difference now is
> that the caller controls how the buffer is written, not
> xfs_iflush...
> 

What I'm saying is that the difference in the interface explains the
historical need for the log force in xfs_iflush(). The difference that
I'm pointing out is that ->push() had no means to return item state. The
sequence for an unpinned inode with a pinned cluster buffer in 2.6.32 is
for ->trylock() to return success after acquiring the ilock and flush
lock, followed by a call to ->push(). We didn't call xfs_iflush() (and
thus get the inode buffer) until the push call:

                switch (lock_result) {
                case XFS_ITEM_SUCCESS:
                        XFS_STATS_INC(xs_push_ail_success);
                        IOP_PUSH(lip);
                        last_pushed_lsn = lsn;
                        break;

... which doesn't even have a return value. Hence there is no way to
return a pinned state for an inode cluster buffer and xfs_iflush() would
have had to do the log force for a pinned cluster buffer itself.

That is no longer the case with the current code. I don't know with
certainty of course, but from that I surmised the xfs_iflush() ->
xfs_log_force() call existed to deal with this historical interface
limitation.

> > An unpinned inode with a pinned cluster buffer didn't communicate this
> > state because the inode ->trylock() callout grabbed the ilock and iflock
> > and returned success. The iflush doesn't occur until ->push(), which
> > means if the cluster buffer is pinned and we wanted to force the log
> > than xfs_iflush() is the last chance to do it.
> > 
> > (For reference, I'm looking at the code as of v2.6.33).
> 
> Again, that's no different to what we have now - the push is for the
> inode, not the backing buffer. If we can't lock the inode because
> it's pinned or already flushing, then we return that info to AIL
> loop.  The acutal state of the backing buffer is still completely
> hidden from the caller and so that's why I left the unpin in that
> code.
> 

Pretty much what I'm saying above.

> > The current xfsaild() log force logic is still tied to PINNED objects,
> > but it has additional logic to detect skipped buffers at delwri queue
> > submit time (i.e., due to being pinned) that should cover the pinned
> > inode cluster buffer case.
> 
> And it's not skipped buffers, either. Skipped buffers are only for
> determining timeouts - it's the ail->ail_log_flush counter that
> determines whether a log flush should be done or not, and that's
> only incremented on ITEM_PINNED objects. This is why I was
> suggesting returning that if the backing buffer is pinned, but I
> forgot all about how we already handle this case...
> 
> > But given the above, couldn't we just remove the log force from
> > xfs_iflush() and let the existing xfsaild_push() logic handle it?
> 
> We can, but not for the reasons you are suggesting.
> 

The reason I suggested is because we currently "detect skipped buffers
at delwri queue submit time (i.e., due to being pinned)," which causes
xfsaild to force the log. I'm not sure how exactly you read that, but it
refers exactly to the code snippet you posted below where we bump the
log flush counter due to skipped buffers on submit (because they were
pinned).

> > It
> > should detect any pinned buffers after the delwri queue submit and do
> > the log force.
> 
> And this is what we already do with this code:
> 

Yes.

> 	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
> 		ailp->ail_log_flush++;
> 
> Which falls down to this code when we try to flush a pinned delwri
> buffer:
> 
> 	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
> 		if (!wait_list) {
> 			if (xfs_buf_ispinned(bp)) {
> 				pinned++;
> 				continue;
> 			}
> ....
> 	return pinned;
> 

"detect skipped buffers at delwri queue submit time (i.e., due to being
pinned)"

> Basically, we leave pinned buffers on the delwri list rather than
> remove them and issue IO on them, and tell the caller that we left
> pinned/locked buffers on the list that still need to be submitted.
> The AIL takes this as an indication it needs to flush the log to
> unpin the buffers it can't flush, and because they are still on it's
> delwri list, it will attempt to submit them again next time around
> after the log has been forced.
> 

Right, so this is exactly the reason I suggested.

Brian

> So given that inode reclaim already waits for inodes to be unpinned
> and there is a xfs_buf_unpin_wait() call in
> xfs_bwrite()->xfs_buf_submit() path that will issue a log force on
> the buffer from the inode reclaim path, we don't actually need the
> log force in xfs_iflush() at all. i.e.  the AIL will capture the
> buffer state and unpin inode and dquot buffers automatically, and so
> we can simply remove the pinned-buffer log forces from the
> inode/dquot flush code....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx