On Thu, Feb 18, 2021 at 11:34:51AM +1100, Dave Chinner wrote: > On Wed, Feb 17, 2021 at 08:23:39AM -0500, Brian Foster wrote: > > The blocks used for allocation btrees (bnobt and countbt) are > > technically considered free space. This is because as free space is > > used, allocbt blocks are removed and naturally become available for > > traditional allocation. However, this means that a significant > > portion of free space may consist of in-use btree blocks if free > > space is severely fragmented. > > > > On large filesystems with large perag reservations, this can lead to > > a rare but nasty condition where a significant amount of physical > > free space is available, but the majority of actual usable blocks > > consist of in-use allocbt blocks. We have a record of a (~12TB, 32 > > AG) filesystem with multiple AGs in a state with ~2.5GB or so free > > blocks tracked across ~300 total allocbt blocks, but effectively at > > 100% full because the the free space is entirely consumed by > > refcountbt perag reservation. > > > > Such a large perag reservation is by design on large filesystems. > > The problem is that because the free space is so fragmented, this AG > > contributes the 300 or so allocbt blocks to the global counters as > > free space. If this pattern repeats across enough AGs, the > > filesystem lands in a state where global block reservation can > > outrun physical block availability. For example, a streaming > > buffered write on the affected filesystem continues to allow delayed > > allocation beyond the point where writeback starts to fail due to > > physical block allocation failures. The expected behavior is for the > > delalloc block reservation to fail gracefully with -ENOSPC before > > physical block allocation failure is a possibility. > > *nod* > > > To address this problem, introduce a percpu counter to track the sum > > of the allocbt block counters already tracked in the AGF. Use the > > new counter to set these blocks aside at reservation time and thus > > ensure they cannot be allocated until truly available. Since this is > > only necessary when large reflink perag reservations are in place > > and the counter requires a read of each AGF to fully populate, only > > enforce on reflink enabled filesystems. This allows initialization > > of the counter at ->pagf_init time because the refcountbt perag > > reservation init code reads each AGF at mount time. > > Ok, so the mechanism sounds ok, but a per-cpu counter seems like > premature optimisation. How often are we really updating btree block > counts? An atomic counter is good for at least a million updates a > second across a 2 socket 32p machine, and I highly doubt we're > incrementing/decrementing btree block counts that often on such a > machine. > > While per-cpu counters have a fast write side, they come with > additional algorithmic complexity. Hence if the update rate of the > counter is not fast enough to need per-cpu counters, we should avoid > them. just because other free space counters use per-cpu counters, > it doesn't mean everything in that path needs to use them... > The use of the percpu counter was more for the read side than the write side. I think of it more of an abstraction to avoid having to open code and define a new spin lock just for this. I actually waffled a bit on just setting a batch count of 0 to get roughly equivalent behavior, but didn't think it would make much difference. > > Note that the counter uses a small percpu batch size to allow the > > allocation paths to keep the primary count accurate enough that the > > reservation path doesn't ever need to lock and sum the counter. > > Absolute accuracy is not required here, just that the counter > > reflects the majority of unavailable blocks so the reservation path > > fails first. > > And this makes the per-cpu counter scale almost no better than an > simple atomic counter, because a spinlock requires two atomic > operations (lock and unlock). Hence a batch count of 4 only reduces > the atomic op count by half but introduces at lot of extra > complexity. It won't make a difference to the scalability of > workloads that hammer the btree block count because contention on > the internal counter spinlock will occur at close to the same > concurrency rate as would occur on an atomic counter. > Right, but percpu_counter_read_positive() allows a fast read in the xfs_mod_fdblocks() path. I didn't use an atomic because I was concerned about introducing overhead in that path. If we're Ok with whatever overhead an atomic read might introduce (a spin lock in the worst case for some arches), then I don't mind switching over to that. I also don't mind defining a new spin lock and explicitly implementing the lockless read in xfs_mod_fdblocks(), I just thought it was extra code for little benefit over the percpu counter. Preference? Brian > Hence a per-cpu counter used in this manner seems like a premature > optimisation to me... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >