Re: [RFC PATCH 1/4] xfs: resurrect the AGFL reservation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 01, 2024 at 10:53:04AM +1000, Dave Chinner wrote:
> On Mon, Jul 22, 2024 at 11:51:25PM -0700, Krister Johansen wrote:
> > I've got something that sets the mp->m_ag_max_usable and
> > alloc_set_aside() to the values discussed above.  For demonstration
> > purposes, since I was ending up with numbers that were smaller than
> > XFS_ALLOCBT_AGFL_RESERVE + 4, I just picked some slightly different math
> > that inflated these numbers a bit to validate the assumptions.
> > 
> > I'm still able to run AG low enough on space to hit the bug with these
> > modifcations in place, partly because mp->m_ag_max_usable seems to place
> > an upper bound on large contiguous allocations in
> > xfs_alloc_space_available().  E.g. if I allocate in small chunks <
> > m_ag_max_usable, I can still run pagf_freeblks down to about the per-ag
> > reservation.  The test is explicit about only using space from a single
> > AG, which defeats the global reservation checks, since enough space
> > remains free in the other AGs that we don't hit any of the
> > per-filesystem reservation limits.
> > 
> > It seems like we've got the right general concept, but may need to make
> > a few more tweaks to the implementation.
> >
> > In terms of where to go next, I have a few ideas, but would certainly
> > appreciate additional input.
> > 
> > One was to investigate propagating the per-AG set-aside into
> > xfs_alloc_space_available() so that it's only eligible for consumption
> > by a dependent allocation.  (Is the correct way to determine that by
> > checking tp->t_highest_agno != NULLAGNUMBER?)
> 
> Yes, I suspect that we aren't taking into account the correct
> set-aside value when checking if the AG has enough space available
> for the allocation to proceed. i.e. xfs_alloc_space_available()
> isn't getting the reserved space correct - it may be that
> xfs_ag_resv_needed() should be taking into account the set aside
> amount for XFS_AG_RESV_NONE allocations.
> 
> > The other was to determine if I needed add the quantity from
> > xfs_alloc_min_freelist(free space height) to the set-aside.  The reason
> > for this was that if the reserved free space can be indexed by a tree of
> > height 2, then the full split only needs 2 blocks, but once we've split
> > to a tree of that height, xfs_alloc_min_freelist() requires 12 blocks
> > free (with rmap off) to continue the allocation chain, even if we're
> > never going to allocate that many.
> 
> Yes, that is definitely -one- of the problems here.
> 
> > It's that edge case that these tests
> > always manage to hit: run the pagf_freeblks within a block or two of the
> > per-ag reservation, and then split a free-space b-tree right before a
> > BMBT split.  IOW, I need both the blocks for the b-tree expansion and
> > enough to satisfy the xfs_alloc_min_freelist() checks.  If you explained
> > this earlier, and I'm reading the equations too pedantically, my
> > apologies.
> 
> No, I think you're getting deeper into the underlying issues that
> we haven't really solved properly as this code has grown organically
> to solve corner case issue after corner case issue.
> 
> Looking at it, and considering the concept of dependent allocation
> chains and reserve pools for different types of allocations I've
> previously explained, I think the calculation in
> xfs_alloc_min_freelist() is wrong.
> 
> That is, xfs_alloc_min_freelist() looks to assume that all AGFL
> blocks for a free space modification need to be reserved up front
> and placed in the AGFL before we start. Historically speaking, this
> was how allocation used to work. Then we introduced intent based
> operations for freeing extents in 1998 so the BMBT modifications
> were in a separate transaction to the extent freeing. Each now had
> their own AGFL refill requirements in completely decoupled
> transactions.
> 
> Then we added reflink and rmapbt as additional operations that are
> decoupled from the BMBT allocations and transactions by intents.
> They all run in their own allocation contexts and require their own
> independent AGFL fill operations.
> 
> However, xfs_alloc_min_freelist() makes the assumption that the AGFL
> must be large enough to handle both free space btree and rmapbt
> splits for every allocation, even though the space for an rmapbt
> split will only ever be needed when running a XFS_AG_RESV_RMAPBT
> allocation.  IOWs, it is over-calculating the required AGFL size for
> most allocations.
> 
> Further, it is calculating this number for every allocation in a
> transaction, without taking into account the "dependent allocation
> chain" context. This is more difficult to do, because the required
> AGFL size only goes down when a full split is done in the chain.
> 
> I think we actually need to track the number of blocks we may need
> and have consumed for the AGFL within a transaction. We initialise
> that in the first transaction and ensure that we have that amount of
> space remaining in the AG, and then we always check that the
> remaining reservation is greater than the minimum free list size
> required for this allocation (it should always be).
> 
> Whenever we move blocks to/from the AGFL, we account for that in the
> transaction. Hence we have a direct accounting of the AGFL
> consumption across dependent allocations within that AG. When we
> complete one of the dependent allocations in a chain, we can reduce
> the reservation held the transaction by the number of AGFL blocks
> reserved but not consumed by that allocation, thereby allowing us to
> use more the remaining free space the further through the allocation
> chain we get....
> 
> IOWs, we need to track the total AGFL reservation/usage within the
> AG across the transaction, not just individual allocations. We need
> xfs_alloc_min_freelist() to be aware of allocation context and not
> reserve RMAPBT space for allocations that aren't modifying the
> rmapbt. We need xfs_alloc_space_available() to initialise and use
> transaction context based AGFL block reservations for considering
> whether there is space available or not. We need to account for AGFL
> blocks consumed during an AGFL grow operation.
> 
> Lifting the AGFL reservation tracking to the transaction structure
> for dependent allocation chains should alleviate these "but we don't
> know what happened on the previous allocation" and "what might be
> needed for the next allocation" issues. If we do this correctly,
> then we should never need to switch AGs during a dependent
> allocation chain because the first allocation in an will be rejected
> with ENOSPC because there isn't enough space for the dependent chain
> to complete....

I like the idea of moving the AGFL reservation tracking into the
transacation, though I'm not certain I understand all the particulars
yet.

In your mental model for how this might work, is the idea that the space
for the AGFL reservation will be added to the transaction once an AG is
selected for the allocations?  Even with this approach, it will still be
necessary to make some worst-case estimates about how much space might
be used in a transaction chain, correct?  E.g. the worst-case AGFL
equations from earlier in this thread still apply, it's just that the
space would be reserved _once_ at the beginning of the first
transaction.

For the rmapbt allocations, it doesn't look like code calls into the
allocator with XFS_AG_RESV_RMAPBT set.  The code that allocates rmapbt
blocks pulls a blocks from the AGFL and then deducts that from its
per-AG reservation.  I tried walking myself through a few examples, but
it looks like it's possible to do an rmap allocation in a dependent
allocation in a transaction even if the first allocation in the chain is
not allocating a rmap record.

To give a specific example: allocating bmbt blocks associate a rmap
entry with the inode for which the bmbt block is being allocated, but
the first allocation may well set XFS_RMAP_OINFO_SKIP_UPDATE, like
xfs_bmap_btalloc does.

It might be possible to use the oinfo in the xfs_alloc_arg structure to
determine if a particular allocation can allocate a rmap entry.
However, it also looks like allocating blocks to refill the AGFL can
itself allocate these blocks.  Is this a situation where we'd need to
audit the different operations and add specific attributes to their
existing xfs_trans_res to account for whether or not they'd allocate
rmap entries? Or is there a better way to do this that I'm overlooking?

Thanks,

-K




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux