[RFC PATCH 0/9] xfs: push perags further into allocation routines

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 4 Oct 2023 11:19:34 +1100

This series continues the work towards making shrinking a filesystem
possible.  We need to be able to stop operations from taking place
on AGs that need to be removed by a shrink, so before shrink can be
implemented we need to have the infrastructure in place to prevent
incursion into AGs that are going to be, or are in the process, of
being removed from active duty.

The focus of this is making operations that depend on access to AGs
use the perag to access and pin the AG in active use, thereby
creating a barrier we can use to delay shrink until all active uses
of an AG have been drained and new uses are prevented.

The previous round of allocator changes pushed per-ags mostly into
the core of the allocator, but it left some rough edges where
allocation routine may or may not be called with perags already
held. This series continues the work on driving the perag further
outwards into the bmap and individual allocation layers to clean up
these warts.

The bmap allocators have some interesting complexities. For example,
they might attempt exact block allocation before attempting aligned
allocation, and then in some cases want to attempt aligned
allocation anywhere in the filesytsem instead of in the same AG as
tehy do in other cases. Hence the code is somewhat complex as it
tries to handle all these different cases.

The first step in untangling it all is splitting the exact block EOF
case away from aligned allocation. If that fails, then we can
attempt a near block aligned allocation. The filestreams allocator
already does this with an attempt to allocate only in the same AG as
the EOF block, but the normal allocator does an "all AG near block"
scan. This latter cases starts with a "near block in the same AG"
pass, then tries any other AG, but it requires dropping the perag
before we start, hence doesn't provide any guarantee that we can
actually get the same start AG again....

This separation then exposes the cases where we should be doing
aligned allocation but we don't or we attempt aligned allocation
when we know it can't succeed. There are several small changes to
take this sort of thing into account when selecting the initial AG
to allocate from.

With that, we then push the perag management out into the intial AG
selection code, thereby guaranteeing that we hold the selected AG
until we've failed all the AG sepcific allocation attempts the
policy defines.

Given that we now largely guarantee we select an AG with enough
space for the initial aligned allocation, there is no longer a need
to do an "all AGs" aligne allocation attempt. We know it can be done
in the selected AG, so failure should be very rare and this allows
us to use the same initial single AG EOF/aligned allocation logic
for both allocation policies.

This then allows us to move to {agno,agbno} based allocation
targets, rather than fsblock based targets. It also means that we
always call xfs_alloc_vextent_exact_bno() with a perag held, so we
can get rid of the conditional perag code in that function. This
makes _near_bno() and _exact_bno() essentially identical except for
the allocation function they call, so they can be collapsed into a
common helper.

And with all this, we now have the APIs simplified to the point
where we can change how allocation failure is signalled. Rather than
having the intenral AG allocators returning success with
arags->agbno == NULLAGBLOCK to indication ENOSPC, and then having to
convert that to returning success with args->fsblock == NULLFSBLOCK
to indicate allocation failure to the higher layers, we can convert
all the code to return -ENOSPC when allocation failure occurs.

This is intended to avoid the problems inherent in detecting
"successful allocation that failed" situation that lead to the data
corruption problem in the 6.3 cycle - if we fail to catch ENOSPC
explicitly now, the allocation will still return an error and fail.
Such a failure will likely result in a filesystem shutdown, which is
a *much* better failure behaviour than writing data to a random
location on the block device....

The end result is a slightly more efficient allocation path that
selects AGs at the highest level possible for initial allocation
attempts, uses ENOSPC errors to detect allocation failures, and only
uses AG iteration based allocation algorithms in the cases where the
initial targetted allocations fail. It also makes it much clearer
where we are doing stripe aligned allocations versus non-aligned
allocations.

This passes fstests and various data tests (e.g fio), but hasn't
been strenuously tested yet. I'm posting it because of the
forced-align functionality that has been talked about, and this
series makes it quite clear what "aligned allocation" currently
means and how that is quite different to what "force-align" is
intended to mean.