Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Wed, 17 May 2023 19:41:31 +0000

> On May 16, 2023, at 6:54 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> 
> On Wed, May 17, 2023 at 11:40:05AM +1000, Dave Chinner wrote:
>> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
>>> Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes:
>>> 
>>> xfs_alloc_fix_freelist ->
>>> xfs_alloc_ag_vextent_size ->
>>> (run all the way to the end of the bnobt) ->
>>> xfs_extent_busy_flush ->
>>> <stall on the busy extent that's in @tp->busy_list>
>>> 
>>> xfs_extent_busy_flush does this, potentially while we're holding the
>>> freed extent in @tp->t_busy_list:
>>> 
>>> error = xfs_log_force(mp, XFS_LOG_SYNC);
>>> if (error)
>>> return;
>>> 
>>> do {
>>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>>> if  (busy_gen != READ_ONCE(pag->pagb_gen))
>>> break;
>>> schedule();
>>> } while (1);
>>> 
>>> finish_wait(&pag->pagb_wait, &wait);
>>> 
>>> The log force kicks the CIL to process whatever other committed items
>>> might be lurking in the log.  *Hopefully* someone else freed an extent
>>> in the same AG, so the log force has now caused that *other* extent to
>>> get processed so it has now cleared the busy list.  Clearing something
>>> from the busy list increments the busy generation (aka pagb_gen).
>> 
>> *nod*
>> 
>>> Unfortunately, there aren't any other extents, so the busy_gen does not
>>> change, and the loop runs forever.
>>> 
>>> At this point, Dave writes:
>>> 
>>> [15:57] <dchinner> so if we enter that function with busy extents on the
>>> transaction, and we are doing an extent free operation, we should return
>>> after the sync log force and not do the generation number wait
>>> 
>>> [15:58] <dchinner> if we fail to allocate again after the sync log force
>>> and the generation number hasn't changed, then return -EAGAIN because no
>>> progress has been made.
>>> 
>>> [15:59] <dchinner> Then the transaction is rolled, the transaction busy
>>> list is cleared, and if the next allocation attempt fails becaues
>>> everything is busy, we go to sleep waiting for the generation to change
>>> 
>>> [16:00] <dchinner> but because the transaction does not hold any busy
>>> extents, it cannot deadlock here because it does not pin any extents
>>> that are in the busy tree....
>>> 
>>> [16:05] <dchinner> All the generation number is doing here is telling us
>>> whether there was busy extent resolution between the time we last
>>> skipped a viable extent because it was busy and when the flush
>>> completes.
>>> 
>>> [16:06] <dchinner> it doesn't mean the next allocation will succeed,
>>> just that progress has been made so trying the allocation attempt will
>>> at least get a different result to the previous scan.
>>> 
>>> I think the callsites go from this:
>>> 
>>> if (busy) {
>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>> trace_xfs_alloc_size_busy(args);
>>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>>> goto restart;
>>> }
>> 
>> I was thinking this code changes to:
>> 
>> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
>> ....
>> <attempt allocation>
>> ....
>> if (busy) {
>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>> trace_xfs_alloc_size_busy(args);
>> error = xfs_extent_busy_flush(args->tp, args->pag,
>> busy_gen, flags);
>> if (!error) {
>> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;
>> goto restart;
>> }
>> /* jump to cleanup exit point */
>> goto out_error;
>> }
>> 
>> Note the different first parameter - we pass args->tp, not args->mp
>> so that the flush has access to the transaction context...
> 
> <nod>
> 
>>> to something like this:
>>> 
>>> bool try_log_flush = true;
>>> ...
>>> restart:
>>> ...
>>> 
>>> if (busy) {
>>> bool progress;
>>> 
>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>> trace_xfs_alloc_size_busy(args);
>>> 
>>> /*
>>>  * If the current transaction has an extent on the busy
>>>  * list, we're allocating space as part of freeing
>>>  * space, and all the free space is busy, we can't hang
>>>  * here forever.  Force the log to try to unbusy free
>>>  * space that could have been freed by other
>>>  * transactions, and retry the allocation.  If the
>>>  * allocation fails a second time because all the free
>>>  * space is busy and nobody made any progress with
>>>  * clearing busy extents, return EAGAIN so the caller
>>>  * can roll this transaction.
>>>  */
>>> if ((flags & XFS_ALLOC_FLAG_FREEING) &&
>>>     !list_empty(&tp->t_busy_list)) {
>>> int log_flushed;
>>> 
>>> if (try_log_flush) {
>>> _xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed);
>>> try_log_flush = false;
>>> goto restart;
>>> }
>>> 
>>> if (busy_gen == READ_ONCE(pag->pagb_gen))
>>> return -EAGAIN;
>>> 
>>> /* XXX should we set try_log_flush = true? */
>>> goto restart;
>>> }
>>> 
>>> xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
>>> goto restart;
>>> }
>>> 
>>> IOWs, I think Dave wants us to keep the changes in the allocator instead
>>> of spreading it around.
>> 
>> Sort of - I want the busy extent flush code to be isolated inside
>> xfs_extent_busy_flush(), not spread around the allocator. :)
>> 
>> xfs_extent_busy_flush(
>> struct xfs_trans *tp,
>> struct xfs_perag *pag,
>> unsigned int busy_gen,
>> unsigned int flags)
>> {
>> error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>> if (error)
>> return error;
>> 
>> /*
>>  * If we are holding busy extents, the caller may not want
>>  * to block straight away. If we are being told just to try
>>  * a flush or progress has been made since we last skipped a busy
>>  * extent, return immediately to allow the caller to try
>>  * again. If we are freeing extents, we might actually be
>>  * holding the onyly free extents in the transaction busy
> 
>                       only
> 
>>  * list and the log force won't resolve that situation. In
>>  * this case, return -EAGAIN in that case to tell the caller
>>  * it needs to commit the busy extents it holds before
>>  * retrying the extent free operation.
>>  */
>> if (!list_empty(&tp->t_busy_list)) {
>> if (flags & XFS_ALLOC_FLAG_TRY_FLUSH)
>> return 0;
>> if (busy_gen != READ_ONCE(pag->pagb_gen))
>> return 0;
>> if (flags & XFS_ALLOC_FLAG_FREEING)
>> return -EAGAIN;
>> }
> 
> Indeed, that's a lot cleaner.
> 
>> 
>> /* wait for progressing resolving busy extents */
>> do {
>> prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
>> if  (busy_gen != READ_ONCE(pag->pagb_gen))
>> break;
>> schedule();
>> } while (1);
>> 
>> finish_wait(&pag->pagb_wait, &wait);
>> return 0;
>> }
>> 
>> It seems cleaner to me to put this all in xfs_extent_busy_flush()
>> rather than having open-coded handling of extent free constraints in
>> each potential flush location. We already have retry semantics
>> around the flush, let's just extend them slightly....
> 
> <nod> Wengang, how does this sound?

Thanks Darrick, pls see my previous reply.
thanks,
wengang