> On May 16, 2023, at 6:54 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Wed, May 17, 2023 at 11:40:05AM +1000, Dave Chinner wrote: >> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote: >>> Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes: >>> >>> xfs_alloc_fix_freelist -> >>> xfs_alloc_ag_vextent_size -> >>> (run all the way to the end of the bnobt) -> >>> xfs_extent_busy_flush -> >>> <stall on the busy extent that's in @tp->busy_list> >>> >>> xfs_extent_busy_flush does this, potentially while we're holding the >>> freed extent in @tp->t_busy_list: >>> >>> error = xfs_log_force(mp, XFS_LOG_SYNC); >>> if (error) >>> return; >>> >>> do { >>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); >>> if (busy_gen != READ_ONCE(pag->pagb_gen)) >>> break; >>> schedule(); >>> } while (1); >>> >>> finish_wait(&pag->pagb_wait, &wait); >>> >>> The log force kicks the CIL to process whatever other committed items >>> might be lurking in the log. *Hopefully* someone else freed an extent >>> in the same AG, so the log force has now caused that *other* extent to >>> get processed so it has now cleared the busy list. Clearing something >>> from the busy list increments the busy generation (aka pagb_gen). >> >> *nod* >> >>> Unfortunately, there aren't any other extents, so the busy_gen does not >>> change, and the loop runs forever. >>> >>> At this point, Dave writes: >>> >>> [15:57] <dchinner> so if we enter that function with busy extents on the >>> transaction, and we are doing an extent free operation, we should return >>> after the sync log force and not do the generation number wait >>> >>> [15:58] <dchinner> if we fail to allocate again after the sync log force >>> and the generation number hasn't changed, then return -EAGAIN because no >>> progress has been made. >>> >>> [15:59] <dchinner> Then the transaction is rolled, the transaction busy >>> list is cleared, and if the next allocation attempt fails becaues >>> everything is busy, we go to sleep waiting for the generation to change >>> >>> [16:00] <dchinner> but because the transaction does not hold any busy >>> extents, it cannot deadlock here because it does not pin any extents >>> that are in the busy tree.... >>> >>> [16:05] <dchinner> All the generation number is doing here is telling us >>> whether there was busy extent resolution between the time we last >>> skipped a viable extent because it was busy and when the flush >>> completes. >>> >>> [16:06] <dchinner> it doesn't mean the next allocation will succeed, >>> just that progress has been made so trying the allocation attempt will >>> at least get a different result to the previous scan. >>> >>> I think the callsites go from this: >>> >>> if (busy) { >>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); >>> trace_xfs_alloc_size_busy(args); >>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen); >>> goto restart; >>> } >> >> I was thinking this code changes to: >> >> flags |= XFS_ALLOC_FLAG_TRY_FLUSH; >> .... >> <attempt allocation> >> .... >> if (busy) { >> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); >> trace_xfs_alloc_size_busy(args); >> error = xfs_extent_busy_flush(args->tp, args->pag, >> busy_gen, flags); >> if (!error) { >> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH; >> goto restart; >> } >> /* jump to cleanup exit point */ >> goto out_error; >> } >> >> Note the different first parameter - we pass args->tp, not args->mp >> so that the flush has access to the transaction context... > > <nod> > >>> to something like this: >>> >>> bool try_log_flush = true; >>> ... >>> restart: >>> ... >>> >>> if (busy) { >>> bool progress; >>> >>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); >>> trace_xfs_alloc_size_busy(args); >>> >>> /* >>> * If the current transaction has an extent on the busy >>> * list, we're allocating space as part of freeing >>> * space, and all the free space is busy, we can't hang >>> * here forever. Force the log to try to unbusy free >>> * space that could have been freed by other >>> * transactions, and retry the allocation. If the >>> * allocation fails a second time because all the free >>> * space is busy and nobody made any progress with >>> * clearing busy extents, return EAGAIN so the caller >>> * can roll this transaction. >>> */ >>> if ((flags & XFS_ALLOC_FLAG_FREEING) && >>> !list_empty(&tp->t_busy_list)) { >>> int log_flushed; >>> >>> if (try_log_flush) { >>> _xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed); >>> try_log_flush = false; >>> goto restart; >>> } >>> >>> if (busy_gen == READ_ONCE(pag->pagb_gen)) >>> return -EAGAIN; >>> >>> /* XXX should we set try_log_flush = true? */ >>> goto restart; >>> } >>> >>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen); >>> goto restart; >>> } >>> >>> IOWs, I think Dave wants us to keep the changes in the allocator instead >>> of spreading it around. >> >> Sort of - I want the busy extent flush code to be isolated inside >> xfs_extent_busy_flush(), not spread around the allocator. :) >> >> xfs_extent_busy_flush( >> struct xfs_trans *tp, >> struct xfs_perag *pag, >> unsigned int busy_gen, >> unsigned int flags) >> { >> error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC); >> if (error) >> return error; >> >> /* >> * If we are holding busy extents, the caller may not want >> * to block straight away. If we are being told just to try >> * a flush or progress has been made since we last skipped a busy >> * extent, return immediately to allow the caller to try >> * again. If we are freeing extents, we might actually be >> * holding the onyly free extents in the transaction busy > > only > >> * list and the log force won't resolve that situation. In >> * this case, return -EAGAIN in that case to tell the caller >> * it needs to commit the busy extents it holds before >> * retrying the extent free operation. >> */ >> if (!list_empty(&tp->t_busy_list)) { >> if (flags & XFS_ALLOC_FLAG_TRY_FLUSH) >> return 0; >> if (busy_gen != READ_ONCE(pag->pagb_gen)) >> return 0; >> if (flags & XFS_ALLOC_FLAG_FREEING) >> return -EAGAIN; >> } > > Indeed, that's a lot cleaner. > >> >> /* wait for progressing resolving busy extents */ >> do { >> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE); >> if (busy_gen != READ_ONCE(pag->pagb_gen)) >> break; >> schedule(); >> } while (1); >> >> finish_wait(&pag->pagb_wait, &wait); >> return 0; >> } >> >> It seems cleaner to me to put this all in xfs_extent_busy_flush() >> rather than having open-coded handling of extent free constraints in >> each potential flush location. We already have retry semantics >> around the flush, let's just extend them slightly.... > > <nod> Wengang, how does this sound? Thanks Darrick, pls see my previous reply. thanks, wengang