On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote: > Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes: > > xfs_alloc_fix_freelist -> > xfs_alloc_ag_vextent_size -> > (run all the way to the end of the bnobt) -> > xfs_extent_busy_flush -> > <stall on the busy extent that's in @tp->busy_list> > > xfs_extent_busy_flush does this, potentially while we're holding the > freed extent in @tp->t_busy_list: > > error = xfs_log_force(mp, XFS_LOG_SYNC); > if (error) > return; > > do { > prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); > if (busy_gen != READ_ONCE(pag->pagb_gen)) > break; > schedule(); > } while (1); > > finish_wait(&pag->pagb_wait, &wait); > > The log force kicks the CIL to process whatever other committed items > might be lurking in the log. *Hopefully* someone else freed an extent > in the same AG, so the log force has now caused that *other* extent to > get processed so it has now cleared the busy list. Clearing something > from the busy list increments the busy generation (aka pagb_gen). *nod* > Unfortunately, there aren't any other extents, so the busy_gen does not > change, and the loop runs forever. > > At this point, Dave writes: > > [15:57] <dchinner> so if we enter that function with busy extents on the > transaction, and we are doing an extent free operation, we should return > after the sync log force and not do the generation number wait > > [15:58] <dchinner> if we fail to allocate again after the sync log force > and the generation number hasn't changed, then return -EAGAIN because no > progress has been made. > > [15:59] <dchinner> Then the transaction is rolled, the transaction busy > list is cleared, and if the next allocation attempt fails becaues > everything is busy, we go to sleep waiting for the generation to change > > [16:00] <dchinner> but because the transaction does not hold any busy > extents, it cannot deadlock here because it does not pin any extents > that are in the busy tree.... > > [16:05] <dchinner> All the generation number is doing here is telling us > whether there was busy extent resolution between the time we last > skipped a viable extent because it was busy and when the flush > completes. > > [16:06] <dchinner> it doesn't mean the next allocation will succeed, > just that progress has been made so trying the allocation attempt will > at least get a different result to the previous scan. > > I think the callsites go from this: > > if (busy) { > xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > trace_xfs_alloc_size_busy(args); > xfs_extent_busy_flush(args->mp, args->pag, busy_gen); > goto restart; > } I was thinking this code changes to: flags |= XFS_ALLOC_FLAG_TRY_FLUSH; .... <attempt allocation> .... if (busy) { xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); trace_xfs_alloc_size_busy(args); error = xfs_extent_busy_flush(args->tp, args->pag, busy_gen, flags); if (!error) { flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH; goto restart; } /* jump to cleanup exit point */ goto out_error; } Note the different first parameter - we pass args->tp, not args->mp so that the flush has access to the transaction context... > to something like this: > > bool try_log_flush = true; > ... > restart: > ... > > if (busy) { > bool progress; > > xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > trace_xfs_alloc_size_busy(args); > > /* > * If the current transaction has an extent on the busy > * list, we're allocating space as part of freeing > * space, and all the free space is busy, we can't hang > * here forever. Force the log to try to unbusy free > * space that could have been freed by other > * transactions, and retry the allocation. If the > * allocation fails a second time because all the free > * space is busy and nobody made any progress with > * clearing busy extents, return EAGAIN so the caller > * can roll this transaction. > */ > if ((flags & XFS_ALLOC_FLAG_FREEING) && > !list_empty(&tp->t_busy_list)) { > int log_flushed; > > if (try_log_flush) { > _xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed); > try_log_flush = false; > goto restart; > } > > if (busy_gen == READ_ONCE(pag->pagb_gen)) > return -EAGAIN; > > /* XXX should we set try_log_flush = true? */ > goto restart; > } > > xfs_extent_busy_flush(args->mp, args->pag, busy_gen); > goto restart; > } > > IOWs, I think Dave wants us to keep the changes in the allocator instead > of spreading it around. Sort of - I want the busy extent flush code to be isolated inside xfs_extent_busy_flush(), not spread around the allocator. :) xfs_extent_busy_flush( struct xfs_trans *tp, struct xfs_perag *pag, unsigned int busy_gen, unsigned int flags) { error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); if (error) return error; /* * If we are holding busy extents, the caller may not want * to block straight away. If we are being told just to try * a flush or progress has been made since we last skipped a busy * extent, return immediately to allow the caller to try * again. If we are freeing extents, we might actually be * holding the onyly free extents in the transaction busy * list and the log force won't resolve that situation. In * this case, return -EAGAIN in that case to tell the caller * it needs to commit the busy extents it holds before * retrying the extent free operation. */ if (!list_empty(&tp->t_busy_list)) { if (flags & XFS_ALLOC_FLAG_TRY_FLUSH) return 0; if (busy_gen != READ_ONCE(pag->pagb_gen)) return 0; if (flags & XFS_ALLOC_FLAG_FREEING) return -EAGAIN; } /* wait for progressing resolving busy extents */ do { prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); if (busy_gen != READ_ONCE(pag->pagb_gen)) break; schedule(); } while (1); finish_wait(&pag->pagb_wait, &wait); return 0; } It seems cleaner to me to put this all in xfs_extent_busy_flush() rather than having open-coded handling of extent free constraints in each potential flush location. We already have retry semantics around the flush, let's just extend them slightly.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx