Re: [PATCH] xfs: set aside allocation btree blocks from block reservation

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 18 Feb 2021 08:25:20 -0500

On Thu, Feb 18, 2021 at 11:34:51AM +1100, Dave Chinner wrote:
> On Wed, Feb 17, 2021 at 08:23:39AM -0500, Brian Foster wrote:
> > The blocks used for allocation btrees (bnobt and countbt) are
> > technically considered free space. This is because as free space is
> > used, allocbt blocks are removed and naturally become available for
> > traditional allocation. However, this means that a significant
> > portion of free space may consist of in-use btree blocks if free
> > space is severely fragmented.
> > 
> > On large filesystems with large perag reservations, this can lead to
> > a rare but nasty condition where a significant amount of physical
> > free space is available, but the majority of actual usable blocks
> > consist of in-use allocbt blocks. We have a record of a (~12TB, 32
> > AG) filesystem with multiple AGs in a state with ~2.5GB or so free
> > blocks tracked across ~300 total allocbt blocks, but effectively at
> > 100% full because the the free space is entirely consumed by
> > refcountbt perag reservation.
> > 
> > Such a large perag reservation is by design on large filesystems.
> > The problem is that because the free space is so fragmented, this AG
> > contributes the 300 or so allocbt blocks to the global counters as
> > free space. If this pattern repeats across enough AGs, the
> > filesystem lands in a state where global block reservation can
> > outrun physical block availability. For example, a streaming
> > buffered write on the affected filesystem continues to allow delayed
> > allocation beyond the point where writeback starts to fail due to
> > physical block allocation failures. The expected behavior is for the
> > delalloc block reservation to fail gracefully with -ENOSPC before
> > physical block allocation failure is a possibility.
> 
> *nod*
> 
> > To address this problem, introduce a percpu counter to track the sum
> > of the allocbt block counters already tracked in the AGF. Use the
> > new counter to set these blocks aside at reservation time and thus
> > ensure they cannot be allocated until truly available. Since this is
> > only necessary when large reflink perag reservations are in place
> > and the counter requires a read of each AGF to fully populate, only
> > enforce on reflink enabled filesystems. This allows initialization
> > of the counter at ->pagf_init time because the refcountbt perag
> > reservation init code reads each AGF at mount time.
> 
> Ok, so the mechanism sounds ok, but a per-cpu counter seems like
> premature optimisation. How often are we really updating btree block
> counts? An atomic counter is good for at least a million updates a
> second across a 2 socket 32p machine, and I highly doubt we're
> incrementing/decrementing btree block counts that often on such a
> machine. 
> 
> While per-cpu counters have a fast write side, they come with
> additional algorithmic complexity. Hence if the update rate of the
> counter is not fast enough to need per-cpu counters, we should avoid
> them. just because other free space counters use per-cpu counters,
> it doesn't mean everything in that path needs to use them...
> 

The use of the percpu counter was more for the read side than the write
side. I think of it more of an abstraction to avoid having to open code
and define a new spin lock just for this. I actually waffled a bit on
just setting a batch count of 0 to get roughly equivalent behavior, but
didn't think it would make much difference.

> > Note that the counter uses a small percpu batch size to allow the
> > allocation paths to keep the primary count accurate enough that the
> > reservation path doesn't ever need to lock and sum the counter.
> > Absolute accuracy is not required here, just that the counter
> > reflects the majority of unavailable blocks so the reservation path
> > fails first.
> 
> And this makes the per-cpu counter scale almost no better than an
> simple atomic counter, because a spinlock requires two atomic
> operations (lock and unlock). Hence a batch count of 4 only reduces
> the atomic op count by half but introduces at lot of extra
> complexity. It won't make a difference to the scalability of
> workloads that hammer the btree block count because contention on
> the internal counter spinlock will occur at close to the same
> concurrency rate as would occur on an atomic counter.
> 

Right, but percpu_counter_read_positive() allows a fast read in the
xfs_mod_fdblocks() path. I didn't use an atomic because I was concerned
about introducing overhead in that path. If we're Ok with whatever
overhead an atomic read might introduce (a spin lock in the worst case
for some arches), then I don't mind switching over to that. I also don't
mind defining a new spin lock and explicitly implementing the lockless
read in xfs_mod_fdblocks(), I just thought it was extra code for little
benefit over the percpu counter. Preference?

Brian

> Hence a per-cpu counter used in this manner seems like a premature
> optimisation to me...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>