Re: [PATCH 4/5] xfs: don't block in busy flushing when freeing extents

Chandan Babu R <chandan.babu@xxxxxxxxxx> · Thu, 22 Jun 2023 09:29:46 +0530

On Wed, Jun 21, 2023 at 08:18:01 AM +1000, Dave Chinner wrote:
> On Tue, Jun 20, 2023 at 08:23:33PM +0530, Chandan Babu R wrote:
>> On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote:
>> > From: Dave Chinner <dchinner@xxxxxxxxxx>
>> >
>> > If the current transaction holds a busy extent and we are trying to
>> > allocate a new extent to fix up the free list, we can deadlock if
>> > the AG is entirely empty except for the busy extent held by the
>> > transaction.
> ....
>> > @@ -577,10 +588,23 @@ xfs_extent_busy_flush(
>> >  	DEFINE_WAIT		(wait);
>> >  	int			error;
>> >  
>> > -	error = xfs_log_force(mp, XFS_LOG_SYNC);
>> > +	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> >  	if (error)
>> > -		return;
>> > +		return error;
>> >  
>> > +	/* Avoid deadlocks on uncommitted busy extents. */
>> > +	if (!list_empty(&tp->t_busy)) {
>> > +		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
>> > +			return 0;
>> > +
>> > +		if (busy_gen != READ_ONCE(pag->pagb_gen))
>> > +			return 0;
>> > +
>> > +		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
>> > +			return -EAGAIN;
>> > +	}
>> 
>> In the case where a task is freeing an ondisk inode, an ifree transaction can
>> invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and
>> the next call to free its immediate parent block.
>> 
>> The first call to __xfs_inobt_free_block() adds the freed extent into the
>> transaction's busy list and also into the per-ag rb tree tracking the busy
>> extent. Freeing the second inobt block could lead to the following sequence of
>> function calls,
>> 
>> __xfs_free_extent() => xfs_free_extent_fix_freelist() =>
>> xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size()
>
> Yes, I think you might be right. I checked inode chunks - they are
> freed from this path via:
>
> xfs_ifree
>   xfs_difree
>     xfs_difree_inobt
>       xfs_difree_inobt_chunk
>         xfs_free_extent_later
> 	  <queues an XEFI for deferred freeing>
>
> And I didn't think about the inobt blocks themselves because freeing
> an inode can require allocation of finobt blocks and hence there's a
> transaction reservation for block allocation on finobt enabled
> filesystems. i.e. freeing can't proceed unless there is some amount
> of free blocks available, and that's why the finobt has an amount of
> per-ag space reserved for it.
>
> Hence, for finobt enabled filesystems, I don't think we can ever get
> down to a completely empty AG and an AGFL that needs refilling from
> the inode path - the metadata reserve doesn't allow the AG to be
> completely emptied in the way that is needed for this bug to
> manifest.
>
> Yes, I think it is still possible for all the free space to be busy,
> and so when online discard is enabled we need to do the busy wait
> after the log force to avoid that. However, for non-discard
> filesystems the sync log force is all that is needed to resolve busy
> extents outside the current transaction, so this wouldn't be an
> issue for the current patchset.

Are you planning to post a new version of this patchset which would solve the
possible cancellation of dirty transaction during freeing inobt blocks?  If
not, I will spend some time to review the current version of the patchset.

>
> I suspect that is why I haven't seen issues on v5 filesystems,
> though I also haven't seen issues on v4 filesystems that don't have
> the finobt per-ag metadata reservation nor the space reservation at
> transaction reservation time. I know that the fstests enospc group
> is exercising the busy flush code, but I doubt that it was exercised
> through the inode btree block freeing path...
>
> I note that the refcount btree block freeing path also call
> xfs_free_extent(). This might be OK, because refcount btree updates
> get called from deferred intent processing, and hence the EAGAIN
> will trigger a transaction roll and retry correctly.
>
> I suspect, however, that both of these paths should simply call
> xfs_free_extent_later() to queue an XEFI for deferred processing,
> and that takes the entire extent freeing path out from under the
> btree operations. 
>
> I'll look into that. Thanks!

-- 
chandan