Re: [PATCH] xfs: avoid freeing multiple extents from same AG in pending transactions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On May 17, 2023, at 3:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Wed, May 17, 2023 at 07:14:32PM +0000, Wengang Wang wrote:
>>> On May 16, 2023, at 6:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote:
>>> I was thinking this code changes to:
>>> 
>>> flags |= XFS_ALLOC_FLAG_TRY_FLUSH;
>>> ....
>>> <attempt allocation>
>>> ....
>>> if (busy) {
>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>>> trace_xfs_alloc_size_busy(args);
>>> error = xfs_extent_busy_flush(args->tp, args->pag,
>>> busy_gen, flags);
>>> if (!error) {
>>> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH;
>> 
>> What’s the benefits to use XFS_ALLOC_FLAG_TRY_FLUSH?
>> If no change happened to pagb_gen, we would get nothing good in the retry
>> but waste cycles. Or I missed something?
> 
> You missed something: the synchronous log force is always done.
> 

It’s true that synchronous log force is done.

> The log force is what allows busy extents to be resolved - busy
> extents have to be committed to stable storage before they can be
> removed from the busy extent tree.

Yep.

> 
> If online discards are not enabled, busy extents are resolved
> directly in journal IO completion - the log force waits for this to
> occur. In this case, pag->pagb_gen will have already incremented to
> indicate progress has been made, and we should never wait in the
> loop after the log force. The only time we do that is when the
> current transaction holds busy extents itself, and hence if the
> current tx holds busy extents we should not wait beyond the log
> force....

So you are talking about the case of “pagb_gen will have already incremented”,
In this case your next two lines:

  if (busy_gen != READ_ONCE(pag->pagb_gen))
return 0;

would capture that and return immediately without waiting. So TRY_FLUSH is not
helpful in this case. 

> 
> If online discards are enabled, then they'll be scheduled by journal
> IO completion. i.e. waiting on the log force guarntees pending
> discards have been scheduled and they'll start completing soon after
> the log force returns. When they complete they'll start incrementing
> pag->pagb_gen. This is the case the pag->pagb_gen wait loop exists
> for - it waits for a discard to complete and resolve the busy extent
> in it's IO compeltion routine. At which point the allocation attempt
> can restart.

In above case, you are talking about the case that pag->pagb_gen will be
incremented soon without any blocking.
In this case, in the allocator path, still the two lines:
  if (busy_gen != READ_ONCE(pag->pagb_gen))
return 0;

The condition of "busy_gen != READ_ONCE(pag->pagb_gen)” can be true
or a false according to the race of this process VS the queue workers (which
start completing).  In case it’s true, the code return immediately. Otherwise
the code runs into the loop waiting above condition become true. When
the queue workers incremented pag->pagb_gen, the allocator path jumps
out the loop and go — that’s perfect. I don't see why TRY_FLUSH is needed.


> 
> However, the same caveat about the current tx holding
> busy extents still exists - we can't tell the difference between
> "discards scheduled but not completed" and "no busy extents to
> resolve" in the flush code. Hence regardless of the online discard
> feature state, we should not be waiting on busy extent generation
> changes if we hold busy extents in the transaction....

Yes, that’s true.
But first, without TRY_FLUSH, the code will return -EAGAIN immediately 
with the following two lines:

  if (flags & XFS_ALLOC_FLAG_FREEING)
return -EAGAIN;
without any waiting. So no problem if we don’t use TRY_FLUSH. 


> 
> IOWs, the TRY_FLUSH code reflects the fact that for most users, the
> log force resolves the busy extents, not the wait loop on
> pag->pagb_gen changing. The wait loop only really kicks in when
> online discard is active, and in that case we really do want to
> retry allocation without waiting for (potentially very slow)
> discards to complete first. We'll do that "wait for discards" the
> second time we fail to find a non-busy extent....
> 

It seems that you want give another chance to go over the free extents in AG
to see if we will have luck in the retry without any guarantee. It’s possible
that we are lucky that in the retry, the relevant busy extents are resolved
(cleared), but it’s also possible that we don’t have the luck.

Also the TRY_FLUSH doesn’t look to be a key point to fix the original problem
(deadlock) but an optimization? And that may help only online discards are
enabled (I don’t think that’s a popular mount option, it kills performance greatly).
If you want that anyway, I’d like make it there only when "online discards are
enabled”, otherwise it helps nothing at all but waste cycles.
So change the line of
 
flags |= XFS_ALLOC_FLAG_TRY_FLUSH;

to

if (xfs_has_discard(mp))
    flags |= XFS_ALLOC_FLAG_TRY_FLUSH;

Sounds good?

thanks,
wengang





[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux