On Wed, May 17, 2023 at 11:40:05AM +1000, Dave Chinner wrote: > On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote: > > Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes: > > > > xfs_alloc_fix_freelist -> > > xfs_alloc_ag_vextent_size -> > > (run all the way to the end of the bnobt) -> > > xfs_extent_busy_flush -> > > <stall on the busy extent that's in @tp->busy_list> > > > > xfs_extent_busy_flush does this, potentially while we're holding the > > freed extent in @tp->t_busy_list: > > > > error = xfs_log_force(mp, XFS_LOG_SYNC); > > if (error) > > return; > > > > do { > > prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); > > if (busy_gen != READ_ONCE(pag->pagb_gen)) > > break; > > schedule(); > > } while (1); > > > > finish_wait(&pag->pagb_wait, &wait); > > > > The log force kicks the CIL to process whatever other committed items > > might be lurking in the log. *Hopefully* someone else freed an extent > > in the same AG, so the log force has now caused that *other* extent to > > get processed so it has now cleared the busy list. Clearing something > > from the busy list increments the busy generation (aka pagb_gen). > > *nod* > > > Unfortunately, there aren't any other extents, so the busy_gen does not > > change, and the loop runs forever. > > > > At this point, Dave writes: > > > > [15:57] <dchinner> so if we enter that function with busy extents on the > > transaction, and we are doing an extent free operation, we should return > > after the sync log force and not do the generation number wait > > > > [15:58] <dchinner> if we fail to allocate again after the sync log force > > and the generation number hasn't changed, then return -EAGAIN because no > > progress has been made. > > > > [15:59] <dchinner> Then the transaction is rolled, the transaction busy > > list is cleared, and if the next allocation attempt fails becaues > > everything is busy, we go to sleep waiting for the generation to change > > > > [16:00] <dchinner> but because the transaction does not hold any busy > > extents, it cannot deadlock here because it does not pin any extents > > that are in the busy tree.... > > > > [16:05] <dchinner> All the generation number is doing here is telling us > > whether there was busy extent resolution between the time we last > > skipped a viable extent because it was busy and when the flush > > completes. > > > > [16:06] <dchinner> it doesn't mean the next allocation will succeed, > > just that progress has been made so trying the allocation attempt will > > at least get a different result to the previous scan. > > > > I think the callsites go from this: > > > > if (busy) { > > xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > > trace_xfs_alloc_size_busy(args); > > xfs_extent_busy_flush(args->mp, args->pag, busy_gen); > > goto restart; > > } > > I was thinking this code changes to: > > flags |= XFS_ALLOC_FLAG_TRY_FLUSH; > .... > <attempt allocation> > .... > if (busy) { > xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > trace_xfs_alloc_size_busy(args); > error = xfs_extent_busy_flush(args->tp, args->pag, > busy_gen, flags); > if (!error) { > flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH; > goto restart; > } > /* jump to cleanup exit point */ > goto out_error; > } > > Note the different first parameter - we pass args->tp, not args->mp > so that the flush has access to the transaction context... <nod> > > to something like this: > > > > bool try_log_flush = true; > > ... > > restart: > > ... > > > > if (busy) { > > bool progress; > > > > xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > > trace_xfs_alloc_size_busy(args); > > > > /* > > * If the current transaction has an extent on the busy > > * list, we're allocating space as part of freeing > > * space, and all the free space is busy, we can't hang > > * here forever. Force the log to try to unbusy free > > * space that could have been freed by other > > * transactions, and retry the allocation. If the > > * allocation fails a second time because all the free > > * space is busy and nobody made any progress with > > * clearing busy extents, return EAGAIN so the caller > > * can roll this transaction. > > */ > > if ((flags & XFS_ALLOC_FLAG_FREEING) && > > !list_empty(&tp->t_busy_list)) { > > int log_flushed; > > > > if (try_log_flush) { > > _xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed); > > try_log_flush = false; > > goto restart; > > } > > > > if (busy_gen == READ_ONCE(pag->pagb_gen)) > > return -EAGAIN; > > > > /* XXX should we set try_log_flush = true? */ > > goto restart; > > } > > > > xfs_extent_busy_flush(args->mp, args->pag, busy_gen); > > goto restart; > > } > > > > IOWs, I think Dave wants us to keep the changes in the allocator instead > > of spreading it around. > > Sort of - I want the busy extent flush code to be isolated inside > xfs_extent_busy_flush(), not spread around the allocator. :) > > xfs_extent_busy_flush( > struct xfs_trans *tp, > struct xfs_perag *pag, > unsigned int busy_gen, > unsigned int flags) > { > error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); > if (error) > return error; > > /* > * If we are holding busy extents, the caller may not want > * to block straight away. If we are being told just to try > * a flush or progress has been made since we last skipped a busy > * extent, return immediately to allow the caller to try > * again. If we are freeing extents, we might actually be > * holding the onyly free extents in the transaction busy only > * list and the log force won't resolve that situation. In > * this case, return -EAGAIN in that case to tell the caller > * it needs to commit the busy extents it holds before > * retrying the extent free operation. > */ > if (!list_empty(&tp->t_busy_list)) { > if (flags & XFS_ALLOC_FLAG_TRY_FLUSH) > return 0; > if (busy_gen != READ_ONCE(pag->pagb_gen)) > return 0; > if (flags & XFS_ALLOC_FLAG_FREEING) > return -EAGAIN; > } Indeed, that's a lot cleaner. > > /* wait for progressing resolving busy extents */ > do { > prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); > if (busy_gen != READ_ONCE(pag->pagb_gen)) > break; > schedule(); > } while (1); > > finish_wait(&pag->pagb_wait, &wait); > return 0; > } > > It seems cleaner to me to put this all in xfs_extent_busy_flush() > rather than having open-coded handling of extent free constraints in > each potential flush location. We already have retry semantics > around the flush, let's just extend them slightly.... <nod> Wengang, how does this sound? --D > -Dave. > > -- > Dave Chinner > david@xxxxxxxxxxxxx