Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Wed, 17 May 2023 19:14:32 +0000

> On May 16, 2023, at 6:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
>> Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes:
>> 
>> xfs_alloc_fix_freelist ->
>> xfs_alloc_ag_vextent_size ->
>> (run all the way to the end of the bnobt) ->
>> xfs_extent_busy_flush ->
>> <stall on the busy extent that's in @tp->busy_list>
>> 
>> xfs_extent_busy_flush does this, potentially while we're holding the
>> freed extent in @tp->t_busy_list:
>> 
>> error = xfs_log_force(mp, XFS_LOG_SYNC);
>> if (error)
>> return;
>> 
>> do {
>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>> if  (busy_gen != READ_ONCE(pag->pagb_gen))
>> break;
>> schedule();
>> } while (1);
>> 
>> finish_wait(&pag->pagb_wait, &wait);
>> 
>> The log force kicks the CIL to process whatever other committed items
>> might be lurking in the log.  *Hopefully* someone else freed an extent
>> in the same AG, so the log force has now caused that *other* extent to
>> get processed so it has now cleared the busy list.  Clearing something
>> from the busy list increments the busy generation (aka pagb_gen).
> 
> *nod*
> 
>> Unfortunately, there aren't any other extents, so the busy_gen does not
>> change, and the loop runs forever.
>> 
>> At this point, Dave writes:
>> 
>> [15:57] <dchinner> so if we enter that function with busy extents on the
>> transaction, and we are doing an extent free operation, we should return
>> after the sync log force and not do the generation number wait
>> 
>> [15:58] <dchinner> if we fail to allocate again after the sync log force
>> and the generation number hasn't changed, then return -EAGAIN because no
>> progress has been made.
>> 
>> [15:59] <dchinner> Then the transaction is rolled, the transaction busy
>> list is cleared, and if the next allocation attempt fails becaues
>> everything is busy, we go to sleep waiting for the generation to change
>> 
>> [16:00] <dchinner> but because the transaction does not hold any busy
>> extents, it cannot deadlock here because it does not pin any extents
>> that are in the busy tree....
>> 
>> [16:05] <dchinner> All the generation number is doing here is telling us
>> whether there was busy extent resolution between the time we last
>> skipped a viable extent because it was busy and when the flush
>> completes.
>> 
>> [16:06] <dchinner> it doesn't mean the next allocation will succeed,
>> just that progress has been made so trying the allocation attempt will
>> at least get a different result to the previous scan.
>> 
>> I think the callsites go from this:
>> 
>> if (busy) {
>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>> goto restart;
>> }
> 

Basically I think the following also work by adding complication to allocator
(I didn’t want to do so to leave allocator as simple as possible).

> I was thinking this code changes to:
> 
> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
> ....
> <attempt allocation>
> ....
> if (busy) {
> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> trace_xfs_alloc_size_busy(args);
> error = xfs_extent_busy_flush(args->tp, args->pag,
> busy_gen, flags);
> if (!error) {
> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;

What’s the benefits to use XFS_ALLOC_FLAG_TRY_FLUSH?
If no change happened to pagb_gen, we would get nothing good in the retry
but waste cycles. Or I missed something?

> goto restart;
> }
> /* jump to cleanup exit point */
> goto out_error;
> }
> 
> Note the different first parameter - we pass args->tp, not args->mp
> so that the flush has access to the transaction context...
> 
>> to something like this:
>> 
>> bool try_log_flush = true;
>> ...
>> restart:
>> ...
>> 
>> if (busy) {
>> bool progress;
>> 
>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> 
>> /*
>> * If the current transaction has an extent on the busy
>> * list, we're allocating space as part of freeing
>> * space, and all the free space is busy, we can't hang
>> * here forever.  Force the log to try to unbusy free
>> * space that could have been freed by other
>> * transactions, and retry the allocation.  If the
>> * allocation fails a second time because all the free
>> * space is busy and nobody made any progress with
>> * clearing busy extents, return EAGAIN so the caller
>> * can roll this transaction.
>> */
>> if ((flags & XFS_ALLOC_FLAG_FREEING) &&
>>    !list_empty(&tp->t_busy_list)) {
>> int log_flushed;
>> 
>> if (try_log_flush) {
>> _xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed);
>> try_log_flush = false;
>> goto restart;
>> }
>> 
>> if (busy_gen == READ_ONCE(pag->pagb_gen))
>> return -EAGAIN;
>> 
>> /* XXX should we set try_log_flush = true? */
>> goto restart;
>> }
>> 
>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>> goto restart;
>> }
>> 
>> IOWs, I think Dave wants us to keep the changes in the allocator instead
>> of spreading it around.
> 
> Sort of - I want the busy extent flush code to be isolated inside
> xfs_extent_busy_flush(), not spread around the allocator. :)
> 
> xfs_extent_busy_flush(
> struct xfs_trans *tp,
> struct xfs_perag *pag,
> unsigned int busy_gen,
> unsigned int flags)
> {
> error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> if (error)
> return error;
> 
> /*
> * If we are holding busy extents, the caller may not want
> * to block straight away. If we are being told just to try
> * a flush or progress has been made since we last skipped a busy
> * extent, return immediately to allow the caller to try
> * again. If we are freeing extents, we might actually be
> * holding the onyly free extents in the transaction busy
> * list and the log force won't resolve that situation. In
> * this case, return -EAGAIN in that case to tell the caller
> * it needs to commit the busy extents it holds before
> * retrying the extent free operation.
> */
> if (!list_empty(&tp->t_busy_list)) {
> if (flags & XFS_ALLOC_FLAG_TRY_FLUSH)
> return 0;

I don’t think above is needed. see my previous comment.

thanks,
wengang
> if (busy_gen != READ_ONCE(pag->pagb_gen))
> return 0;
> if (flags & XFS_ALLOC_FLAG_FREEING)
> return -EAGAIN;
> }
> 
> /* wait for progressing resolving busy extents */
> do {
> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
> if  (busy_gen != READ_ONCE(pag->pagb_gen))
> break;
> schedule();
> } while (1);
> 
> finish_wait(&pag->pagb_wait, &wait);
> return 0;
> }
> 
> It seems cleaner to me to put this all in xfs_extent_busy_flush()
> rather than having open-coded handling of extent free constraints in
> each potential flush location. We already have retry semantics
> around the flush, let's just extend them slightly....
> 
> -Dave.
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx