Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 16 May 2023 18:54:33 -0700

On Wed, May 17, 2023 at 11:40:05AM +1000, Dave Chinner wrote:
> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
> > Since 6.3 we got rid of the _THIS_AG indirection stuff and that becomes:
> > 
> > xfs_alloc_fix_freelist ->
> > xfs_alloc_ag_vextent_size ->
> > (run all the way to the end of the bnobt) ->
> > xfs_extent_busy_flush ->
> > <stall on the busy extent that's in @tp->busy_list>
> > 
> > xfs_extent_busy_flush does this, potentially while we're holding the
> > freed extent in @tp->t_busy_list:
> > 
> > 	error = xfs_log_force(mp, XFS_LOG_SYNC);
> > 	if (error)
> > 		return;
> > 
> > 	do {
> > 		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
> > 		if  (busy_gen != READ_ONCE(pag->pagb_gen))
> > 			break;
> > 		schedule();
> > 	} while (1);
> > 
> > 	finish_wait(&pag->pagb_wait, &wait);
> > 
> > The log force kicks the CIL to process whatever other committed items
> > might be lurking in the log.  *Hopefully* someone else freed an extent
> > in the same AG, so the log force has now caused that *other* extent to
> > get processed so it has now cleared the busy list.  Clearing something
> > from the busy list increments the busy generation (aka pagb_gen).
> 
> *nod*
> 
> > Unfortunately, there aren't any other extents, so the busy_gen does not
> > change, and the loop runs forever.
> > 
> > At this point, Dave writes:
> > 
> > [15:57] <dchinner> so if we enter that function with busy extents on the
> > transaction, and we are doing an extent free operation, we should return
> > after the sync log force and not do the generation number wait
> > 
> > [15:58] <dchinner> if we fail to allocate again after the sync log force
> > and the generation number hasn't changed, then return -EAGAIN because no
> > progress has been made.
> > 
> > [15:59] <dchinner> Then the transaction is rolled, the transaction busy
> > list is cleared, and if the next allocation attempt fails becaues
> > everything is busy, we go to sleep waiting for the generation to change
> > 
> > [16:00] <dchinner> but because the transaction does not hold any busy
> > extents, it cannot deadlock here because it does not pin any extents
> > that are in the busy tree....
> > 
> > [16:05] <dchinner> All the generation number is doing here is telling us
> > whether there was busy extent resolution between the time we last
> > skipped a viable extent because it was busy and when the flush
> > completes.
> > 
> > [16:06] <dchinner> it doesn't mean the next allocation will succeed,
> > just that progress has been made so trying the allocation attempt will
> > at least get a different result to the previous scan.
> > 
> > I think the callsites go from this:
> > 
> > 	if (busy) {
> > 		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> > 		trace_xfs_alloc_size_busy(args);
> > 		xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> > 		goto restart;
> > 	}
> 
> I was thinking this code changes to:
> 
> 	flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
> 	....
> 	<attempt allocation>
> 	....
> 	if (busy) {
> 		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> 		trace_xfs_alloc_size_busy(args);
> 		error = xfs_extent_busy_flush(args->tp, args->pag,
> 				busy_gen, flags);
> 		if (!error) {
> 			flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;
> 			goto restart;
> 		}
> 		/* jump to cleanup exit point */
> 		goto out_error;
> 	}
> 
> Note the different first parameter - we pass args->tp, not args->mp
> so that the flush has access to the transaction context...

<nod>

> > to something like this:
> > 
> > 	bool	try_log_flush = true;
> > 	...
> > restart:
> > 	...
> > 
> > 	if (busy) {
> > 		bool	progress;
> > 
> > 		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> > 		trace_xfs_alloc_size_busy(args);
> > 
> > 		/*
> > 		 * If the current transaction has an extent on the busy
> > 		 * list, we're allocating space as part of freeing
> > 		 * space, and all the free space is busy, we can't hang
> > 		 * here forever.  Force the log to try to unbusy free
> > 		 * space that could have been freed by other
> > 		 * transactions, and retry the allocation.  If the
> > 		 * allocation fails a second time because all the free
> > 		 * space is busy and nobody made any progress with
> > 		 * clearing busy extents, return EAGAIN so the caller
> > 		 * can roll this transaction.
> > 		 */
> > 		if ((flags & XFS_ALLOC_FLAG_FREEING) &&
> > 		    !list_empty(&tp->t_busy_list)) {
> > 			int log_flushed;
> > 
> > 			if (try_log_flush) {
> > 				_xfs_log_force(mp, XFS_LOG_SYNC, &log_flushed);
> > 				try_log_flush = false;
> > 				goto restart;
> > 			}
> > 
> > 			if (busy_gen == READ_ONCE(pag->pagb_gen))
> > 				return -EAGAIN;
> > 
> > 			/* XXX should we set try_log_flush = true? */
> > 			goto restart;
> > 		}
> > 
> > 		xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> > 		goto restart;
> > 	}
> > 
> > IOWs, I think Dave wants us to keep the changes in the allocator instead
> > of spreading it around.
> 
> Sort of - I want the busy extent flush code to be isolated inside
> xfs_extent_busy_flush(), not spread around the allocator. :)
> 
> xfs_extent_busy_flush(
> 	struct xfs_trans	*tp,
> 	struct xfs_perag	*pag,
> 	unsigned int		busy_gen,
> 	unsigned int		flags)
> {
> 	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> 	if (error)
> 		return error;
> 
> 	/*
> 	 * If we are holding busy extents, the caller may not want
> 	 * to block straight away. If we are being told just to try
> 	 * a flush or progress has been made since we last skipped a busy
> 	 * extent, return immediately to allow the caller to try
> 	 * again. If we are freeing extents, we might actually be
> 	 * holding the onyly free extents in the transaction busy

                       only

> 	 * list and the log force won't resolve that situation. In
> 	 * this case, return -EAGAIN in that case to tell the caller
> 	 * it needs to commit the busy extents it holds before
> 	 * retrying the extent free operation.
> 	 */
> 	if (!list_empty(&tp->t_busy_list)) {
> 		if (flags & XFS_ALLOC_FLAG_TRY_FLUSH)
> 			return 0;
> 		if (busy_gen != READ_ONCE(pag->pagb_gen))
> 			return 0;
> 		if (flags & XFS_ALLOC_FLAG_FREEING)
> 			return -EAGAIN;
> 	}

Indeed, that's a lot cleaner.

> 
> 	/* wait for progressing resolving busy extents */
> 	do {
> 		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
> 		if  (busy_gen != READ_ONCE(pag->pagb_gen))
> 			break;
> 		schedule();
> 	} while (1);
> 
> 	finish_wait(&pag->pagb_wait, &wait);
> 	return 0;
> }
> 
> It seems cleaner to me to put this all in xfs_extent_busy_flush()
> rather than having open-coded handling of extent free constraints in
> each potential flush location. We already have retry semantics
> around the flush, let's just extend them slightly....

<nod> Wengang, how does this sound?

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx