On 17 Feb 2021 at 18:53, Brian Foster wrote: > The blocks used for allocation btrees (bnobt and countbt) are > technically considered free space. This is because as free space is > used, allocbt blocks are removed and naturally become available for > traditional allocation. However, this means that a significant > portion of free space may consist of in-use btree blocks if free > space is severely fragmented. > > On large filesystems with large perag reservations, this can lead to > a rare but nasty condition where a significant amount of physical > free space is available, but the majority of actual usable blocks > consist of in-use allocbt blocks. We have a record of a (~12TB, 32 > AG) filesystem with multiple AGs in a state with ~2.5GB or so free > blocks tracked across ~300 total allocbt blocks, but effectively at > 100% full because the the free space is entirely consumed by > refcountbt perag reservation. > > Such a large perag reservation is by design on large filesystems. > The problem is that because the free space is so fragmented, this AG > contributes the 300 or so allocbt blocks to the global counters as > free space. If this pattern repeats across enough AGs, the > filesystem lands in a state where global block reservation can > outrun physical block availability. For example, a streaming > buffered write on the affected filesystem continues to allow delayed > allocation beyond the point where writeback starts to fail due to > physical block allocation failures. The expected behavior is for the > delalloc block reservation to fail gracefully with -ENOSPC before > physical block allocation failure is a possibility. > > To address this problem, introduce a percpu counter to track the sum > of the allocbt block counters already tracked in the AGF. Use the > new counter to set these blocks aside at reservation time and thus > ensure they cannot be allocated until truly available. Since this is > only necessary when large reflink perag reservations are in place > and the counter requires a read of each AGF to fully populate, only > enforce on reflink enabled filesystems. This allows initialization > of the counter at ->pagf_init time because the refcountbt perag > reservation init code reads each AGF at mount time. > > Note that the counter uses a small percpu batch size to allow the > allocation paths to keep the primary count accurate enough that the > reservation path doesn't ever need to lock and sum the counter. > Absolute accuracy is not required here, just that the counter > reflects the majority of unavailable blocks so the reservation path > fails first. > The changes look good to me from the perspective of logical correctness. Reviewed-by: Chandan Babu R <chandanrlinux@xxxxxxxxx> -- chandan