On Thu, Jun 22, 2023 at 03:28:28 PM +1000, Dave Chinner wrote: > On Thu, Jun 22, 2023 at 09:29:46AM +0530, Chandan Babu R wrote: >> On Wed, Jun 21, 2023 at 08:18:01 AM +1000, Dave Chinner wrote: >> > On Tue, Jun 20, 2023 at 08:23:33PM +0530, Chandan Babu R wrote: >> >> On Tue, Jun 20, 2023 at 10:20:20 AM +1000, Dave Chinner wrote: >> >> > From: Dave Chinner <dchinner@xxxxxxxxxx> >> >> > >> >> > If the current transaction holds a busy extent and we are trying to >> >> > allocate a new extent to fix up the free list, we can deadlock if >> >> > the AG is entirely empty except for the busy extent held by the >> >> > transaction. >> > .... >> >> > @@ -577,10 +588,23 @@ xfs_extent_busy_flush( >> >> > DEFINE_WAIT (wait); >> >> > int error; >> >> > >> >> > - error = xfs_log_force(mp, XFS_LOG_SYNC); >> >> > + error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); >> >> > if (error) >> >> > - return; >> >> > + return error; >> >> > >> >> > + /* Avoid deadlocks on uncommitted busy extents. */ >> >> > + if (!list_empty(&tp->t_busy)) { >> >> > + if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH) >> >> > + return 0; >> >> > + >> >> > + if (busy_gen != READ_ONCE(pag->pagb_gen)) >> >> > + return 0; >> >> > + >> >> > + if (alloc_flags & XFS_ALLOC_FLAG_FREEING) >> >> > + return -EAGAIN; >> >> > + } >> >> >> >> In the case where a task is freeing an ondisk inode, an ifree transaction can >> >> invoke __xfs_inobt_free_block() twice; Once to free the inobt's leaf block and >> >> the next call to free its immediate parent block. >> >> >> >> The first call to __xfs_inobt_free_block() adds the freed extent into the >> >> transaction's busy list and also into the per-ag rb tree tracking the busy >> >> extent. Freeing the second inobt block could lead to the following sequence of >> >> function calls, >> >> >> >> __xfs_free_extent() => xfs_free_extent_fix_freelist() => >> >> xfs_alloc_fix_freelist() => xfs_alloc_ag_vextent_size() >> > >> > Yes, I think you might be right. I checked inode chunks - they are >> > freed from this path via: >> > >> > xfs_ifree >> > xfs_difree >> > xfs_difree_inobt >> > xfs_difree_inobt_chunk >> > xfs_free_extent_later >> > <queues an XEFI for deferred freeing> >> > >> > And I didn't think about the inobt blocks themselves because freeing >> > an inode can require allocation of finobt blocks and hence there's a >> > transaction reservation for block allocation on finobt enabled >> > filesystems. i.e. freeing can't proceed unless there is some amount >> > of free blocks available, and that's why the finobt has an amount of >> > per-ag space reserved for it. >> > >> > Hence, for finobt enabled filesystems, I don't think we can ever get >> > down to a completely empty AG and an AGFL that needs refilling from >> > the inode path - the metadata reserve doesn't allow the AG to be >> > completely emptied in the way that is needed for this bug to >> > manifest. >> > >> > Yes, I think it is still possible for all the free space to be busy, >> > and so when online discard is enabled we need to do the busy wait >> > after the log force to avoid that. However, for non-discard >> > filesystems the sync log force is all that is needed to resolve busy >> > extents outside the current transaction, so this wouldn't be an >> > issue for the current patchset. >> >> Are you planning to post a new version of this patchset which would solve the >> possible cancellation of dirty transaction during freeing inobt blocks? If >> not, I will spend some time to review the current version of the patchset. > > I'm working on moving all the direct calls to xfs_free_extent to use > xfs_free_extent_later(). It will be a totally separate preparation > patch for the series, so everything else in the patchset should > largely remain unchanged. > > I haven't finished the patch or tested it yet, but to give you an > idea of what and why, the commit message currently reads: > > xfs: use deferred frees for btree block freeing > > Btrees that aren't freespace management trees use the normal extent > allocation and freeing routines for their blocks. Hence when a btree > block is freed, a direct call to xfs_free_extent() is made and the > extent is immediately freed. This puts the entire free space > management btrees under this path, so we are stacking btrees on > btrees in the call stack. The inobt, finobt and refcount btrees > all do this. > > However, the bmap btree does not do this - it calls > xfs_free_extent_later() to defer the extent free operation via an > XEFI and hence it gets processed in deferred operation processing > during the commit of the primary transaction (i.e. via intent > chaining). > > We need to change xfs_free_extent() to behave in a non-blocking > manner so that we can avoid deadlocks with busy extents near ENOSPC > in transactions that free multiple extents. Inserting or removing a > record from a btree can cause a multi-level tree merge operation and > that will free multiple blocks from the btree in a single > transaction. i.e. we can call xfs_free_extent() multiple times, and > hence the btree manipulation transaction is vulnerable to this busy > extent deadlock vector. > > To fix this, convert all the remaining callers of xfs_free_extent() > to use xfs_free_extent_later() to queue XEFIs and hence defer > processing of the extent frees to a context that can be safely > restarted if a deadlock condition is detected. > Ok. In that case, I will defer review until the new version of the patchset is posted. Thanks for working on this. -- chandan