Re: [PATCH] xfs: fix AGFL allocation dead lock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 14, 2023 at 07:38:57 AM +1000, Dave Chinner wrote:
> On Thu, Apr 13, 2023 at 03:46:38PM +0530, Chandan Babu R wrote:
>> On Thu, Apr 13, 2023 at 09:59:15 AM +1000, Dave Chinner wrote:
>> > On Wed, Apr 12, 2023 at 01:53:59PM +0530, Chandan Babu R wrote:
>> >> Processing each of the "struct xfs_refcount_intent" can cause two
>> >> refcount
>> >> btree blocks to be freed:
>> >> - A high level transacation will invoke
>> >> xfs_refcountbt_free_block() twice.
>> >> - The first invocation adds an extent entry to the transaction's
>> >> busy extent
>> >>   list. The second invocation can find the previously freed busy
>> >> extent and
>> >>   hence wait indefinitely for the busy extent to be flushed.
>> >> 
>> >> Also, processing a single "struct xfs_refcount_intent" can
>> >> require the leaf
>> >> block and its immediate parent block to be freed.  The leaf block
>> >> is added to
>> >> the transaction's busy list. Freeing the parent block can result
>> >> in the task
>> >> waiting for the busy extent (present in the high level transaction) to be
>> >> flushed.
>> >
>> > Yes, it probably can, but this is a different problem - this is an
>> > internal btree update issue, not a "free multiple user extents in a
>> > single transaction that may only have a reservation large enough
>> > for a single user extent modification".
>> >
>> > So, lets think about why the allocator might try to reuse a busy
>> > extent on the next extent internal btree free operation in the
>> > transaction.  The only way that I can see that happening is if the
>> > AGFL needs refilling, and the only way the allocator should get
>> > stuck in this way is if there are no other free extents in the AG.
>> 
>> If the first extent that was freed by the transaction (and hence also marked
>> busy) happens to be the first among several other non-busy free
>> extents found
>> during AGFL allocation, the task will get blocked waiting for the
>> busy extent
>> flush to complete. However, this can be solved if
>> xfs_alloc_ag_vextent_size()
>> is modified to traverse the rest of the free space btree to find other
>> non-busy extents. Busy extents can be flushed only as a last resort when
>> non-busy extents cannot be found.
>
> Yes, exactly my point: Don't block on busy extents if there are
> other free space candidates available to can be allocated without
> blocking.
>
>> 
>> >
>> > It then follows that if there are no other free extents in the AG,
>> > then we don't need to refill the AGFL, because freeing an extent in
>> > an empty AG will never cause the free space btrees to grow. In which
>> > case, we should not ever need to allocate from an extent that was
>> > previously freed in this specific transaction.
>> >
>> > We should also have XFS_ALLOC_FLAG_FREEING set, and this allows the
>> > AGFL refill to abort without error if there are no free blocks
>> > available because it's not necessary in this case.  If we can't find
>> > a non-busy extent after flushing on an AGFL fill for a
>> > XFS_ALLOC_FLAG_FREEING operation, we should just abort the freelist
>> > refill and allow the extent free operation to continue.
>> 
>> I tried in vain to figure out a correct way to perform non-blocking busy
>> extent flush. If it involves using a timeout mechanism, then I don't know as
>> to what constitues a good timeout value. Please let me know if you have any
>> suggestions to this end.
>
> Why would a non-blocking busy extent flush sleep? By definition,
> "non blocking" means "does not context switch away from the current
> task".
>
> IOWs, a non-blocking busy extent flush is effectively just log force
> operation. We may need to sample the generation number to determine
> if progress has been made next time we get back to the "all extents
> are busy so flush the busy extents" logic again, but otherwise I
> think all we need do here is elide the "wait for generation number
> to change" logic.
>
> Then, if we've retried the allocation after a log force, and we
> still haven't made progress and we are doing an AGFL allocation in
> XFS_ALLOC_FLAG_FREEING context, then we can simply abort the
> allocation attempt at this point. i.e. if we don't have extents we
> can allocate immediately, don't bother waiting for busy extents to
> be resolved...

The corresponding pseudocode for allocating blocks to AGFL during an extent
free operation would be the following,
1. Search all extents in the CNTBT for non-busy extents.
2. Return if extent is found; Otherwise,
3. Issue a log force.
4. Retry non-busy free extent allocation.
5. Return back regardless of whether non-busy extents were found or not.

I think I may have found one scenario where the above pseudocode would break.

Assume that an AG has very few blocks of free space left and that these blocks
fit within the root node of CNTBT.

Assume that we have a large file which is backed up by contiguous extents
belonging to the AG and that these extents have a refcount of 1.

fpunch operation is issued on alternating blocks of this large file and the
punched out extents are added to,
1. Per-ag RB tree that tracks busy extents
2. CNTBT causing the CNTBT to grow.

Step 2 can cause consumption of blocks in AGFL and we can end up with a CNTBT
having two fully filled leaf blocks and one root node.

At this instance, a fpunch operation on a file extent having a refcount of 2
can cause the leaf node of the refcount btree to be freed. The free operation
now tries to fill the AGFL. However, it only finds busy extents and issues a
log force. The operation can fail to find any non-busy free extents because
the busy extents probably have not yet been discarded.

The free extent operation fails to fill up the AGFL. Later, adding the free
extent (refcount btree's leaf node) to the CNTBT can require the CNTBT to be
expanded (E.g. Adding a new leaf node). However, this fails since AGFL doesn't
have any free blocks left.

-- 
chandan



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux