Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Thu, 18 May 2023 01:25:10 +0000

> On May 17, 2023, at 6:01 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Thu, May 18, 2023 at 12:10:53AM +0000, Wengang Wang wrote:
>> 
>> 
>>> On May 17, 2023, at 3:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Wed, May 17, 2023 at 07:14:32PM +0000, Wengang Wang wrote:
>>>>> On May 16, 2023, at 6:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>>>> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
>>>>> I was thinking this code changes to:
>>>>> 
>>>>> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
>>>>> ....
>>>>> <attempt allocation>
>>>>> ....
>>>>> if (busy) {
>>>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>>>> trace_xfs_alloc_size_busy(args);
>>>>> error = xfs_extent_busy_flush(args->tp, args->pag,
>>>>> busy_gen, flags);
>>>>> if (!error) {
>>>>> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;
>>>> 
>>>> What’s the benefits to use XFS_ALLOC_FLAG_TRY_FLUSH?
>>>> If no change happened to pagb_gen, we would get nothing good in the retry
>>>> but waste cycles. Or I missed something?
>>> 
>>> You missed something: the synchronous log force is always done.
>>> 
>> 
>> It’s true that synchronous log force is done.
>> 
>>> The log force is what allows busy extents to be resolved - busy
>>> extents have to be committed to stable storage before they can be
>>> removed from the busy extent tree.
>> 
>> Yep.
>> 
>>> 
>>> If online discards are not enabled, busy extents are resolved
>>> directly in journal IO completion - the log force waits for this to
>>> occur. In this case, pag->pagb_gen will have already incremented to
>>> indicate progress has been made, and we should never wait in the
>>> loop after the log force. The only time we do that is when the
>>> current transaction holds busy extents itself, and hence if the
>>> current tx holds busy extents we should not wait beyond the log
>>> force....
>> 
>> So you are talking about the case of “pagb_gen will have already incremented”,
>> In this case your next two lines:
>> 
>>  if (busy_gen != READ_ONCE(pag->pagb_gen))
>> return 0;
>> 
>> would capture that and return immediately without waiting. So TRY_FLUSH is not
>> helpful in this case. 
> 
> Except when pag->pagb_gen doesn't change. If it hasn't changed, then
> we'd immediately return -EAGAIN without trying again. That is not
> what the behaviour we want; we want the allocation to always try
> again at least once after a flush has been run, because ....
> 
>>> If online discards are enabled, then they'll be scheduled by journal
>>> IO completion. i.e. waiting on the log force guarntees pending
>>> discards have been scheduled and they'll start completing soon after
>>> the log force returns. When they complete they'll start incrementing
>>> pag->pagb_gen. This is the case the pag->pagb_gen wait loop exists
>>> for - it waits for a discard to complete and resolve the busy extent
>>> in it's IO compeltion routine. At which point the allocation attempt
>>> can restart.
>> 
>> In above case, you are talking about the case that pag->pagb_gen will be
>> incremented soon without any blocking.
>> In this case, in the allocator path, still the two lines:
>>  if (busy_gen != READ_ONCE(pag->pagb_gen))
>> return 0;
>> 
>> The condition of "busy_gen != READ_ONCE(pag->pagb_gen)” can be true
>> or a false according to the race of this process VS the queue workers (which
>> start completing).  In case it’s true, the code return immediately. Otherwise
>> the code runs into the loop waiting above condition become true. When
>> the queue workers incremented pag->pagb_gen, the allocator path jumps
>> out the loop and go — that’s perfect. I don't see why TRY_FLUSH is needed.
> 
> .... we sample pag->pagb_gen part way through the extent search and
> we might sample it multiple times if we encounter multiple busy
> extents.  The way the cursor currently works is that it stores the
> last busy gen returned. The pag->pagb_gen value might change between
> the first busy extent we encounter and the last that we encounter in
> the search. If this happens, it implies that some busy extents have
> resolved while we have been searching.
> 
> In that case, we can enter the flush with busy_gen = pag->pagb_gen,
> but that doesn't mean no busy extents have been resolved since we
> started the allocation search. Hence the flush itself may not
> resolve any new busy extents, but we still may have other busy
> extents that were resolved while we were scanning the tree.
> 

Yep, there could be such situations though don’t know frequently that happens.

> Yes, we could change the way we track the busy gen in the cursor,
> but that likely has other subtle impacts. e.g. we now have to
> consider the fact we may never get a wakeup because all busy extents
> in flight have already been resolved before we go to sleep.

Hm.. yep, this could be a future enhancement. 

> 
> It is simpler to always retry the allocation after a minimal flush
> to capture busy extents that were resolved while we were scanning
> the tree. This is a slow path - we don't care if we burn extra CPU
> on a retry that makes no progress because this condition is only
> likely to occur when we are trying to do allocation or freeing
> extents very near ENOSPC.

Agreed.

> 
> The try-flush gives us a mechanism that always does a minimum flush
> first before a retry. If that retry fails, then we block, fail or
> retry based on whether progress has been made. But we always want
> that initial retry before we block, fail or retry...
> 

OK.

>> So change the line of
>> 
>> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
>> 
>> to
>> 
>> if (xfs_has_discard(mp))
>>    flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
>> 
>> Sounds good?
> 
> No. The allocator itself should know nothing about when discards are
> enabled - the whole point of the busy extent infrastructure handling
> discards is to abstract discard delays away from the allocator
> itself. The allocator only cares if the extent is busy or not, it
> doesn't care why the extent is busy. And, as per above, the
> TRY_FLUSH triggered retry is needed regardless of discards because
> busy extents can resolve while a search is in progress instead of
> being resolved by the log force...

OK, will be with TRY_FLUSH.

thanks,
wengang