Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 17 May 2023 11:40:05 +1000

On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
> Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes:
> 
> xfs_alloc_fix_freelist ->
> xfs_alloc_ag_vextent_size ->
> (run all the way to the end of the bnobt) ->
> xfs_extent_busy_flush ->
> <stall on the busy extent that's in @tp->busy_list>
> 
> xfs_extent_busy_flush does this, potentially while we're holding the
> freed extent in @tp->t_busy_list:
> 
> 	error = xfs_log_force(mp, XFS_LOG_SYNC);
> 	if (error)
> 		return;
> 
> 	do {
> 		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
> 		if  (busy_gen != READ_ONCE(pag->pagb_gen))
> 			break;
> 		schedule();
> 	} while (1);
> 
> 	finish_wait(&pag->pagb_wait, &wait);
> 
> The log force kicks the CIL to process whatever other committed items
> might be lurking in the log.  *Hopefully* someone else freed an extent
> in the same AG, so the log force has now caused that *other* extent to
> get processed so it has now cleared the busy list.  Clearing something
> from the busy list increments the busy generation (aka pagb_gen).

*nod*

> Unfortunately, there aren't any other extents, so the busy_gen does not
> change, and the loop runs forever.
> 
> At this point, Dave writes:
> 
> [15:57] <dchinner> so if we enter that function with busy extents on the
> transaction, and we are doing an extent free operation, we should return
> after the sync log force and not do the generation number wait
> 
> [15:58] <dchinner> if we fail to allocate again after the sync log force
> and the generation number hasn't changed, then return -EAGAIN because no
> progress has been made.
> 
> [15:59] <dchinner> Then the transaction is rolled, the transaction busy
> list is cleared, and if the next allocation attempt fails becaues
> everything is busy, we go to sleep waiting for the generation to change
> 
> [16:00] <dchinner> but because the transaction does not hold any busy
> extents, it cannot deadlock here because it does not pin any extents
> that are in the busy tree....
> 
> [16:05] <dchinner> All the generation number is doing here is telling us
> whether there was busy extent resolution between the time we last
> skipped a viable extent because it was busy and when the flush
> completes.
> 
> [16:06] <dchinner> it doesn't mean the next allocation will succeed,
> just that progress has been made so trying the allocation attempt will
> at least get a different result to the previous scan.
> 
> I think the callsites go from this:
> 
> 	if (busy) {
> 		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> 		trace_xfs_alloc_size_busy(args);
> 		xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> 		goto restart;
> 	}

I was thinking this code changes to:

	flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
	....
	<attempt allocation>
	....
	if (busy) {
		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
		trace_xfs_alloc_size_busy(args);
		error = xfs_extent_busy_flush(args->tp, args->pag,
				busy_gen, flags);
		if (!error) {
			flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;
			goto restart;
		}
		/* jump to cleanup exit point */
		goto out_error;
	}

Note the different first parameter - we pass args->tp, not args->mp
so that the flush has access to the transaction context...

> to something like this:
> 
> 	bool	try_log_flush = true;
> 	...
> restart:
> 	...
> 
> 	if (busy) {
> 		bool	progress;
> 
> 		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> 		trace_xfs_alloc_size_busy(args);
> 
> 		/*
> 		 * If the current transaction has an extent on the busy
> 		 * list, we're allocating space as part of freeing
> 		 * space, and all the free space is busy, we can't hang
> 		 * here forever.  Force the log to try to unbusy free
> 		 * space that could have been freed by other
> 		 * transactions, and retry the allocation.  If the
> 		 * allocation fails a second time because all the free
> 		 * space is busy and nobody made any progress with
> 		 * clearing busy extents, return EAGAIN so the caller
> 		 * can roll this transaction.
> 		 */
> 		if ((flags & XFS_ALLOC_FLAG_FREEING) &&
> 		    !list_empty(&tp->t_busy_list)) {
> 			int log_flushed;
> 
> 			if (try_log_flush) {
> 				_xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed);
> 				try_log_flush = false;
> 				goto restart;
> 			}
> 
> 			if (busy_gen == READ_ONCE(pag->pagb_gen))
> 				return -EAGAIN;
> 
> 			/* XXX should we set try_log_flush = true? */
> 			goto restart;
> 		}
> 
> 		xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> 		goto restart;
> 	}
> 
> IOWs, I think Dave wants us to keep the changes in the allocator instead
> of spreading it around.

Sort of - I want the busy extent flush code to be isolated inside
xfs_extent_busy_flush(), not spread around the allocator. :)

xfs_extent_busy_flush(
	struct xfs_trans	*tp,
	struct xfs_perag	*pag,
	unsigned int		busy_gen,
	unsigned int		flags)
{
	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
	if (error)
		return error;

	/*
	 * If we are holding busy extents, the caller may not want
	 * to block straight away. If we are being told just to try
	 * a flush or progress has been made since we last skipped a busy
	 * extent, return immediately to allow the caller to try
	 * again. If we are freeing extents, we might actually be
	 * holding the onyly free extents in the transaction busy
	 * list and the log force won't resolve that situation. In
	 * this case, return -EAGAIN in that case to tell the caller
	 * it needs to commit the busy extents it holds before
	 * retrying the extent free operation.
	 */
	if (!list_empty(&tp->t_busy_list)) {
		if (flags & XFS_ALLOC_FLAG_TRY_FLUSH)
			return 0;
		if (busy_gen != READ_ONCE(pag->pagb_gen))
			return 0;
		if (flags & XFS_ALLOC_FLAG_FREEING)
			return -EAGAIN;
	}

	/* wait for progressing resolving busy extents */
	do {
		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
		if  (busy_gen != READ_ONCE(pag->pagb_gen))
			break;
		schedule();
	} while (1);

	finish_wait(&pag->pagb_wait, &wait);
	return 0;
}

It seems cleaner to me to put this all in xfs_extent_busy_flush()
rather than having open-coded handling of extent free constraints in
each potential flush location. We already have retry semantics
around the flush, let's just extend them slightly....

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx