On Wed, Jun 21, 2023 at 08:18:01 AM +1000, Dave Chinner wrote: > On Tue, Jun 20, 2023 at 08:23:33PM +0530, Chandan Babu R wrote: >> On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote: >> > From: Dave Chinner <dchinner@xxxxxxxxxx> >> > >> > If the current transaction holds a busy extent and we are trying to >> > allocate a new extent to fix up the free list, we can deadlock if >> > the AG is entirely empty except for the busy extent held by the >> > transaction. > .... >> > @@ -577,10 +588,23 @@ xfs_extent_busy_flush( >> > DEFINE_WAIT (wait); >> > int error; >> > >> > - error = xfs_log_force(mp, XFS_LOG_SYNC); >> > + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); >> > if (error) >> > - return; >> > + return error; >> > >> > + /* Avoid deadlocks on uncommitted busy extents. */ >> > + if (!list_empty(&tp->t_busy)) { >> > + if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH) >> > + return 0; >> > + >> > + if (busy_gen != READ_ONCE(pag->pagb_gen)) >> > + return 0; >> > + >> > + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) >> > + return -EAGAIN; >> > + } >> >> In the case where a task is freeing an ondisk inode, an ifree transaction can >> invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and >> the next call to free its immediate parent block. >> >> The first call to __xfs_inobt_free_block() adds the freed extent into the >> transaction's busy list and also into the per-ag rb tree tracking the busy >> extent. Freeing the second inobt block could lead to the following sequence of >> function calls, >> >> __xfs_free_extent() => xfs_free_extent_fix_freelist() => >> xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size() > > Yes, I think you might be right. I checked inode chunks - they are > freed from this path via: > > xfs_ifree > xfs_difree > xfs_difree_inobt > xfs_difree_inobt_chunk > xfs_free_extent_later > <queues an XEFI for deferred freeing> > > And I didn't think about the inobt blocks themselves because freeing > an inode can require allocation of finobt blocks and hence there's a > transaction reservation for block allocation on finobt enabled > filesystems. i.e. freeing can't proceed unless there is some amount > of free blocks available, and that's why the finobt has an amount of > per-ag space reserved for it. > > Hence, for finobt enabled filesystems, I don't think we can ever get > down to a completely empty AG and an AGFL that needs refilling from > the inode path - the metadata reserve doesn't allow the AG to be > completely emptied in the way that is needed for this bug to > manifest. > > Yes, I think it is still possible for all the free space to be busy, > and so when online discard is enabled we need to do the busy wait > after the log force to avoid that. However, for non-discard > filesystems the sync log force is all that is needed to resolve busy > extents outside the current transaction, so this wouldn't be an > issue for the current patchset. Are you planning to post a new version of this patchset which would solve the possible cancellation of dirty transaction during freeing inobt blocks? If not, I will spend some time to review the current version of the patchset. > > I suspect that is why I haven't seen issues on v5 filesystems, > though I also haven't seen issues on v4 filesystems that don't have > the finobt per-ag metadata reservation nor the space reservation at > transaction reservation time. I know that the fstests enospc group > is exercising the busy flush code, but I doubt that it was exercised > through the inode btree block freeing path... > > I note that the refcount btree block freeing path also call > xfs_free_extent(). This might be OK, because refcount btree updates > get called from deferred intent processing, and hence the EAGAIN > will trigger a transaction roll and retry correctly. > > I suspect, however, that both of these paths should simply call > xfs_free_extent_later() to queue an XEFI for deferred processing, > and that takes the entire extent freeing path out from under the > btree operations. > > I'll look into that. Thanks! -- chandan