Re: [PATCH] xfs: set aside allocation btree blocks from block reservation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 17 Feb 2021 at 18:53, Brian Foster wrote:
> The blocks used for allocation btrees (bnobt and countbt) are
> technically considered free space. This is because as free space is
> used, allocbt blocks are removed and naturally become available for
> traditional allocation. However, this means that a significant
> portion of free space may consist of in-use btree blocks if free
> space is severely fragmented.
>
> On large filesystems with large perag reservations, this can lead to
> a rare but nasty condition where a significant amount of physical
> free space is available, but the majority of actual usable blocks
> consist of in-use allocbt blocks. We have a record of a (~12TB, 32
> AG) filesystem with multiple AGs in a state with ~2.5GB or so free
> blocks tracked across ~300 total allocbt blocks, but effectively at
> 100% full because the the free space is entirely consumed by
> refcountbt perag reservation.
>
> Such a large perag reservation is by design on large filesystems.
> The problem is that because the free space is so fragmented, this AG
> contributes the 300 or so allocbt blocks to the global counters as
> free space. If this pattern repeats across enough AGs, the
> filesystem lands in a state where global block reservation can
> outrun physical block availability. For example, a streaming
> buffered write on the affected filesystem continues to allow delayed
> allocation beyond the point where writeback starts to fail due to
> physical block allocation failures. The expected behavior is for the
> delalloc block reservation to fail gracefully with -ENOSPC before
> physical block allocation failure is a possibility.
>
> To address this problem, introduce a percpu counter to track the sum
> of the allocbt block counters already tracked in the AGF. Use the
> new counter to set these blocks aside at reservation time and thus
> ensure they cannot be allocated until truly available. Since this is
> only necessary when large reflink perag reservations are in place
> and the counter requires a read of each AGF to fully populate, only
> enforce on reflink enabled filesystems. This allows initialization
> of the counter at ->pagf_init time because the refcountbt perag
> reservation init code reads each AGF at mount time.
>
> Note that the counter uses a small percpu batch size to allow the
> allocation paths to keep the primary count accurate enough that the
> reservation path doesn't ever need to lock and sum the counter.
> Absolute accuracy is not required here, just that the counter
> reflects the majority of unavailable blocks so the reservation path
> fails first.
>

The changes look good to me from the perspective of logical correctness.

Reviewed-by: Chandan Babu R <chandanrlinux@xxxxxxxxx>

-- 
chandan



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux