> On May 16, 2023, at 6:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote: >> Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes: >> >> xfs_alloc_fix_freelist -> >> xfs_alloc_ag_vextent_size -> >> (run all the way to the end of the bnobt) -> >> xfs_extent_busy_flush -> >> <stall on the busy extent that's in @tp->busy_list> >> >> xfs_extent_busy_flush does this, potentially while we're holding the >> freed extent in @tp->t_busy_list: >> >> error = xfs_log_force(mp, XFS_LOG_SYNC); >> if (error) >> return; >> >> do { >> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); >> if (busy_gen != READ_ONCE(pag->pagb_gen)) >> break; >> schedule(); >> } while (1); >> >> finish_wait(&pag->pagb_wait, &wait); >> >> The log force kicks the CIL to process whatever other committed items >> might be lurking in the log. *Hopefully* someone else freed an extent >> in the same AG, so the log force has now caused that *other* extent to >> get processed so it has now cleared the busy list. Clearing something >> from the busy list increments the busy generation (aka pagb_gen). > > *nod* > >> Unfortunately, there aren't any other extents, so the busy_gen does not >> change, and the loop runs forever. >> >> At this point, Dave writes: >> >> [15:57] <dchinner> so if we enter that function with busy extents on the >> transaction, and we are doing an extent free operation, we should return >> after the sync log force and not do the generation number wait >> >> [15:58] <dchinner> if we fail to allocate again after the sync log force >> and the generation number hasn't changed, then return -EAGAIN because no >> progress has been made. >> >> [15:59] <dchinner> Then the transaction is rolled, the transaction busy >> list is cleared, and if the next allocation attempt fails becaues >> everything is busy, we go to sleep waiting for the generation to change >> >> [16:00] <dchinner> but because the transaction does not hold any busy >> extents, it cannot deadlock here because it does not pin any extents >> that are in the busy tree.... >> >> [16:05] <dchinner> All the generation number is doing here is telling us >> whether there was busy extent resolution between the time we last >> skipped a viable extent because it was busy and when the flush >> completes. >> >> [16:06] <dchinner> it doesn't mean the next allocation will succeed, >> just that progress has been made so trying the allocation attempt will >> at least get a different result to the previous scan. >> >> I think the callsites go from this: >> >> if (busy) { >> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); >> trace_xfs_alloc_size_busy(args); >> xfs_extent_busy_flush(args->mp, args->pag, busy_gen); >> goto restart; >> } > Basically I think the following also work by adding complication to allocator (I didn’t want to do so to leave allocator as simple as possible). > I was thinking this code changes to: > > flags |= XFS_ALLOC_FLAG_TRY_FLUSH; > .... > <attempt allocation> > .... > if (busy) { > xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); > trace_xfs_alloc_size_busy(args); > error = xfs_extent_busy_flush(args->tp, args->pag, > busy_gen, flags); > if (!error) { > flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH; What’s the benefits to use XFS_ALLOC_FLAG_TRY_FLUSH? If no change happened to pagb_gen, we would get nothing good in the retry but waste cycles. Or I missed something? > goto restart; > } > /* jump to cleanup exit point */ > goto out_error; > } > > Note the different first parameter - we pass args->tp, not args->mp > so that the flush has access to the transaction context... > >> to something like this: >> >> bool try_log_flush = true; >> ... >> restart: >> ... >> >> if (busy) { >> bool progress; >> >> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); >> trace_xfs_alloc_size_busy(args); >> >> /* >> * If the current transaction has an extent on the busy >> * list, we're allocating space as part of freeing >> * space, and all the free space is busy, we can't hang >> * here forever. Force the log to try to unbusy free >> * space that could have been freed by other >> * transactions, and retry the allocation. If the >> * allocation fails a second time because all the free >> * space is busy and nobody made any progress with >> * clearing busy extents, return EAGAIN so the caller >> * can roll this transaction. >> */ >> if ((flags & XFS_ALLOC_FLAG_FREEING) && >> !list_empty(&tp->t_busy_list)) { >> int log_flushed; >> >> if (try_log_flush) { >> _xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed); >> try_log_flush = false; >> goto restart; >> } >> >> if (busy_gen == READ_ONCE(pag->pagb_gen)) >> return -EAGAIN; >> >> /* XXX should we set try_log_flush = true? */ >> goto restart; >> } >> >> xfs_extent_busy_flush(args->mp, args->pag, busy_gen); >> goto restart; >> } >> >> IOWs, I think Dave wants us to keep the changes in the allocator instead >> of spreading it around. > > Sort of - I want the busy extent flush code to be isolated inside > xfs_extent_busy_flush(), not spread around the allocator. :) > > xfs_extent_busy_flush( > struct xfs_trans *tp, > struct xfs_perag *pag, > unsigned int busy_gen, > unsigned int flags) > { > error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); > if (error) > return error; > > /* > * If we are holding busy extents, the caller may not want > * to block straight away. If we are being told just to try > * a flush or progress has been made since we last skipped a busy > * extent, return immediately to allow the caller to try > * again. If we are freeing extents, we might actually be > * holding the onyly free extents in the transaction busy > * list and the log force won't resolve that situation. In > * this case, return -EAGAIN in that case to tell the caller > * it needs to commit the busy extents it holds before > * retrying the extent free operation. > */ > if (!list_empty(&tp->t_busy_list)) { > if (flags & XFS_ALLOC_FLAG_TRY_FLUSH) > return 0; I don’t think above is needed. see my previous comment. thanks, wengang > if (busy_gen != READ_ONCE(pag->pagb_gen)) > return 0; > if (flags & XFS_ALLOC_FLAG_FREEING) > return -EAGAIN; > } > > /* wait for progressing resolving busy extents */ > do { > prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); > if (busy_gen != READ_ONCE(pag->pagb_gen)) > break; > schedule(); > } while (1); > > finish_wait(&pag->pagb_wait, &wait); > return 0; > } > > It seems cleaner to me to put this all in xfs_extent_busy_flush() > rather than having open-coded handling of extent free constraints in > each potential flush location. We already have retry semantics > around the flush, let's just extend them slightly.... > > -Dave. > > -- > Dave Chinner > david@xxxxxxxxxxxxx