Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 18 May 2023 11:01:03 +1000

On Thu, May 18, 2023 at 12:10:53AM +0000, Wengang Wang wrote:
> 
> 
> > On May 17, 2023, at 3:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > 
> > On Wed, May 17, 2023 at 07:14:32PM +0000, Wengang Wang wrote:
> >>> On May 16, 2023, at 6:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >>> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
> >>> I was thinking this code changes to:
> >>> 
> >>> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
> >>> ....
> >>> <attempt allocation>
> >>> ....
> >>> if (busy) {
> >>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> >>> trace_xfs_alloc_size_busy(args);
> >>> error = xfs_extent_busy_flush(args->tp, args->pag,
> >>> busy_gen, flags);
> >>> if (!error) {
> >>> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;
> >> 
> >> What’s the benefits to use XFS_ALLOC_FLAG_TRY_FLUSH?
> >> If no change happened to pagb_gen, we would get nothing good in the retry
> >> but waste cycles. Or I missed something?
> > 
> > You missed something: the synchronous log force is always done.
> > 
> 
> It’s true that synchronous log force is done.
> 
> > The log force is what allows busy extents to be resolved - busy
> > extents have to be committed to stable storage before they can be
> > removed from the busy extent tree.
> 
> Yep.
> 
> > 
> > If online discards are not enabled, busy extents are resolved
> > directly in journal IO completion - the log force waits for this to
> > occur. In this case, pag->pagb_gen will have already incremented to
> > indicate progress has been made, and we should never wait in the
> > loop after the log force. The only time we do that is when the
> > current transaction holds busy extents itself, and hence if the
> > current tx holds busy extents we should not wait beyond the log
> > force....
> 
> So you are talking about the case of “pagb_gen will have already incremented”,
> In this case your next two lines:
> 
>   if (busy_gen != READ_ONCE(pag->pagb_gen))
> return 0;
> 
> would capture that and return immediately without waiting. So TRY_FLUSH is not
> helpful in this case. 

Except when pag->pagb_gen doesn't change. If it hasn't changed, then
we'd immediately return -EAGAIN without trying again. That is not
what the behaviour we want; we want the allocation to always try
again at least once after a flush has been run, because ....

> > If online discards are enabled, then they'll be scheduled by journal
> > IO completion. i.e. waiting on the log force guarntees pending
> > discards have been scheduled and they'll start completing soon after
> > the log force returns. When they complete they'll start incrementing
> > pag->pagb_gen. This is the case the pag->pagb_gen wait loop exists
> > for - it waits for a discard to complete and resolve the busy extent
> > in it's IO compeltion routine. At which point the allocation attempt
> > can restart.
> 
> In above case, you are talking about the case that pag->pagb_gen will be
> incremented soon without any blocking.
> In this case, in the allocator path, still the two lines:
>   if (busy_gen != READ_ONCE(pag->pagb_gen))
> return 0;
>
> The condition of "busy_gen != READ_ONCE(pag->pagb_gen)” can be true
> or a false according to the race of this process VS the queue workers (which
> start completing).  In case it’s true, the code return immediately. Otherwise
> the code runs into the loop waiting above condition become true. When
> the queue workers incremented pag->pagb_gen, the allocator path jumps
> out the loop and go — that’s perfect. I don't see why TRY_FLUSH is needed.

.... we sample pag->pagb_gen part way through the extent search and
we might sample it multiple times if we encounter multiple busy
extents.  The way the cursor currently works is that it stores the
last busy gen returned. The pag->pagb_gen value might change between
the first busy extent we encounter and the last that we encounter in
the search. If this happens, it implies that some busy extents have
resolved while we have been searching.

In that case, we can enter the flush with busy_gen = pag->pagb_gen,
but that doesn't mean no busy extents have been resolved since we
started the allocation search. Hence the flush itself may not
resolve any new busy extents, but we still may have other busy
extents that were resolved while we were scanning the tree.

Yes, we could change the way we track the busy gen in the cursor,
but that likely has other subtle impacts. e.g. we now have to
consider the fact we may never get a wakeup because all busy extents
in flight have already been resolved before we go to sleep.

It is simpler to always retry the allocation after a minimal flush
to capture busy extents that were resolved while we were scanning
the tree. This is a slow path - we don't care if we burn extra CPU
on a retry that makes no progress because this condition is only
likely to occur when we are trying to do allocation or freeing
extents very near ENOSPC.

The try-flush gives us a mechanism that always does a minimum flush
first before a retry. If that retry fails, then we block, fail or
retry based on whether progress has been made. But we always want
that initial retry before we block, fail or retry...

> So change the line of
>  
> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
> 
> to
> 
> if (xfs_has_discard(mp))
>     flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
> 
> Sounds good?

No. The allocator itself should know nothing about when discards are
enabled - the whole point of the busy extent infrastructure handling
discards is to abstract discard delays away from the allocator
itself. The allocator only cares if the extent is busy or not, it
doesn't care why the extent is busy. And, as per above, the
TRY_FLUSH triggered retry is needed regardless of discards because
busy extents can resolve while a search is in progress instead of
being resolved by the log force...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx