Re: [PATCH 4/5] xfs: don't block in busy flushing when freeing extents

Chandan Babu R <chandan.babu@xxxxxxxxxx> · Thu, 22 Jun 2023 11:06:50 +0530

On Thu, Jun 22, 2023 at 03:28:28 PM +1000, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 09:29:46AM +0530, Chandan Babu R wrote:
>> On Wed, Jun 21, 2023 at 08:18:01 AM +1000, Dave Chinner wrote:
>> > On Tue, Jun 20, 2023 at 08:23:33PM +0530, Chandan Babu R wrote:
>> >> On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote:
>> >> > From: Dave Chinner <dchinner@xxxxxxxxxx>
>> >> >
>> >> > If the current transaction holds a busy extent and we are trying to
>> >> > allocate a new extent to fix up the free list, we can deadlock if
>> >> > the AG is entirely empty except for the busy extent held by the
>> >> > transaction.
>> > ....
>> >> > @@ -577,10 +588,23 @@ xfs_extent_busy_flush(
>> >> >  	DEFINE_WAIT		(wait);
>> >> >  	int			error;
>> >> >  
>> >> > -	error = xfs_log_force(mp, XFS_LOG_SYNC);
>> >> > +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> >> >  	if (error)
>> >> > -		return;
>> >> > +		return error;
>> >> >  
>> >> > +	/* Avoid deadlocks on uncommitted busy extents. */
>> >> > +	if (!list_empty(&tp->t_busy)) {
>> >> > +		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
>> >> > +			return 0;
>> >> > +
>> >> > +		if (busy_gen != READ_ONCE(pag->pagb_gen))
>> >> > +			return 0;
>> >> > +
>> >> > +		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
>> >> > +			return -EAGAIN;
>> >> > +	}
>> >> 
>> >> In the case where a task is freeing an ondisk inode, an ifree transaction can
>> >> invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and
>> >> the next call to free its immediate parent block.
>> >> 
>> >> The first call to __xfs_inobt_free_block() adds the freed extent into the
>> >> transaction's busy list and also into the per-ag rb tree tracking the busy
>> >> extent. Freeing the second inobt block could lead to the following sequence of
>> >> function calls,
>> >> 
>> >> __xfs_free_extent() => xfs_free_extent_fix_freelist() =>
>> >> xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size()
>> >
>> > Yes, I think you might be right. I checked inode chunks - they are
>> > freed from this path via:
>> >
>> > xfs_ifree
>> >   xfs_difree
>> >     xfs_difree_inobt
>> >       xfs_difree_inobt_chunk
>> >         xfs_free_extent_later
>> > 	  <queues an XEFI for deferred freeing>
>> >
>> > And I didn't think about the inobt blocks themselves because freeing
>> > an inode can require allocation of finobt blocks and hence there's a
>> > transaction reservation for block allocation on finobt enabled
>> > filesystems. i.e. freeing can't proceed unless there is some amount
>> > of free blocks available, and that's why the finobt has an amount of
>> > per-ag space reserved for it.
>> >
>> > Hence, for finobt enabled filesystems, I don't think we can ever get
>> > down to a completely empty AG and an AGFL that needs refilling from
>> > the inode path - the metadata reserve doesn't allow the AG to be
>> > completely emptied in the way that is needed for this bug to
>> > manifest.
>> >
>> > Yes, I think it is still possible for all the free space to be busy,
>> > and so when online discard is enabled we need to do the busy wait
>> > after the log force to avoid that. However, for non-discard
>> > filesystems the sync log force is all that is needed to resolve busy
>> > extents outside the current transaction, so this wouldn't be an
>> > issue for the current patchset.
>> 
>> Are you planning to post a new version of this patchset which would solve the
>> possible cancellation of dirty transaction during freeing inobt blocks?  If
>> not, I will spend some time to review the current version of the patchset.
>
> I'm working on moving all the direct calls to xfs_free_extent to use
> xfs_free_extent_later(). It will be a totally separate preparation
> patch for the series, so everything else in the patchset should
> largely remain unchanged.
>
> I haven't finished the patch or tested it yet, but to give you an
> idea of what and why, the commit message currently reads:
>
>    xfs: use deferred frees for btree block freeing
>    
>    Btrees that aren't freespace management trees use the normal extent
>    allocation and freeing routines for their blocks. Hence when a btree
>    block is freed, a direct call to xfs_free_extent() is made and the
>    extent is immediately freed. This puts the entire free space
>    management btrees under this path, so we are stacking btrees on
>    btrees in the call stack. The inobt, finobt and refcount btrees
>    all do this.
>    
>    However, the bmap btree does not do this - it calls
>    xfs_free_extent_later() to defer the extent free operation via an
>    XEFI and hence it gets processed in deferred operation processing
>    during the commit of the primary transaction (i.e. via intent
>    chaining).
>    
>    We need to change xfs_free_extent() to behave in a non-blocking
>    manner so that we can avoid deadlocks with busy extents near ENOSPC
>    in transactions that free multiple extents. Inserting or removing a
>    record from a btree can cause a multi-level tree merge operation and
>    that will free multiple blocks from the btree in a single
>    transaction. i.e. we can call xfs_free_extent() multiple times, and
>    hence the btree manipulation transaction is vulnerable to this busy
>    extent deadlock vector.
>    
>    To fix this, convert all the remaining callers of xfs_free_extent()
>    to use xfs_free_extent_later() to queue XEFIs and hence defer
>    processing of the extent frees to a context that can be safely
>    restarted if a deadlock condition is detected.
>

Ok. In that case, I will defer review until the new version of the patchset is
posted. Thanks for working on this.

-- 
chandan