Re: [GIT PULL] xfs: per-ag centric allocation algorithms

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 16 Feb 2023 13:43:41 -0800

On Wed, Feb 15, 2023 at 12:11:09PM +1100, Dave Chinner wrote:
> Hi Darrick,
> 
> As you requested yesterday on #xfs, please find below a pull request
> for the allocator rework series. It is unchanged from the last
> version I posted except for updates to RVB tags and a rebase on the
> current linux-xfs/for-next branch. Details of the changes made in
> the series are captured in the tag identifying the branch for you to
> pull.
> 
> PLease let me know if I've screwed anything up in the tree or
> commit metadata and I'll get them sorted ASAP.

Pulled, thanks.

--D

> 
> Cheers,
> 
> Dave.
> 
> ---
> 
> The following changes since commit dd07bb8b6baf2389caff221f043d9188ce6bab8c:
> 
>   xfs: revert commit 8954c44ff477 (2023-02-10 09:06:06 -0800)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-alloc-perag-conversion
> 
> for you to fetch changes up to bd4f5d09cc93c8ca51e4efea86ac90a4bb553d6e:
> 
>   xfs: refactor the filestreams allocator pick functions (2023-02-13 09:14:56 +1100)
> 
> ----------------------------------------------------------------
> xfs: per-ag centric allocation alogrithms
> 
> This series continues the work towards making shrinking a filesystem
> possible.  We need to be able to stop operations from taking place
> on AGs that need to be removed by a shrink, so before shrink can be
> implemented we need to have the infrastructure in place to prevent
> incursion into AGs that are going to be, or are in the process, of
> being removed from active duty.
> 
> The focus of this is making operations that depend on access to AGs
> use the perag to access and pin the AG in active use, thereby
> creating a barrier we can use to delay shrink until all active uses
> of an AG have been drained and new uses are prevented.
> 
> This series starts by fixing some existing issues that are exposed
> by changes later in the series. They stand alone, so can be picked
> up independently of the rest of this patchset.
> 
> The most complex of these fixes is cleaning up the mess that is the
> AGF deadlock avoidance algorithm. This algorithm stores the first
> block that is allocated in a transaction in tp->t_firstblock, then
> uses this to try to limit future allocations within the transaction
> to AGs at or higher than the filesystem block stored in
> tp->t_firstblock. This depends on one of the initial bug fixes in
> the series to move the deadlock avoidance checks to
> xfs_alloc_vextent(), and then builds on it to relax the constraints
> of the avoidance algorithm to only be active when a deadlock is
> possible.
> 
> We also update the algorithm to record allocations from higher AGs
> that are allocated from, because we when we need to lock more than
> two AGs we still have to ensure lock order is correct. Therefore we
> can't lock AGs in the order 1, 3, 2, even though tp->t_firstblock
> indicates that we've allocated from AG 1 and so AG is valid to lock.
> It's not valid, because we already hold AG 3 locked, and so
> tp->t-first_block should actually point at AG 3, not AG 1 in this
> situation.
> 
> It should now be obvious that the deadlock avoidance algorithm
> should record AGs, not filesystem blocks. So the series then changes
> the transaction to store the highest AG we've allocated in rather
> than a filesystem block we allocated.  This makes it obvious what
> the constraints are, and trivial to update as we lock and allocate
> from various AGs.
> 
> With all the bug fixes out of the way, the series then starts
> converting the code to use active references. Active reference
> counts are used by high level code that needs to prevent the AG from
> being taken out from under it by a shrink operation. The high level
> code needs to be able to handle not getting an active reference
> gracefully, and the shrink code will need to wait for active
> references to drain before continuing.
> 
> Active references are implemented just as reference counts right now
> - an active reference is taken at perag init during mount, and all
> other active references are dependent on the active reference count
> being greater than zero. This gives us an initial method of stopping
> new active references without needing other infrastructure; just
> drop the reference taken at filesystem mount time and when the
> refcount then falls to zero no new references can be taken.
> 
> In future, this will need to take into account AG control state
> (e.g. offline, no alloc, etc) as well as the reference count, but
> right now we can implement a basic barrier for shrink with just
> reference count manipulations. As such, patches to convert the perag
> state to atomic opstate fields similar to the xfs_mount and xlog
> opstate fields follow the initial active perag reference counting
> patches.
> 
> The first target for active reference conversion is the
> for_each_perag*() iterators. This captures a lot of high level code
> that should skip offline AGs, and introduces the ability to
> differentiate between a lookup that didn't have an online AG and the
> end of the AG iteration range.
> 
> From there, the inode allocation AG selection is converted to active
> references, and the perag is driven deeper into the inode allocation
> and btree code to replace the xfs_mount. Most of the inode
> allocation code operates on a single AG once it is selected, hence
> it should pass the perag as the primary referenced object around for
> allocation, not the xfs_mount. There is a bit of churn here, but it
> emphasises that inode allocation is inherently an allocation group
> based operation.
> 
> Next the bmap/alloc interface undergoes a major untangling,
> reworking xfs_bmap_btalloc() into separate allocation operations for
> different contexts and failure handling behaviours. This then allows
> us to completely remove the xfs_alloc_vextent() layer via
> restructuring the xfs_alloc_vextent/xfs_alloc_ag_vextent() into a
> set of realtively simple helper function that describe the
> allocation that they are doing. e.g.  xfs_alloc_vextent_exact_bno().
> 
> This allows the requirements for accessing AGs to be allocation
> context dependent. The allocations that require operation on a
> single AG generally can't tolerate failure after the allocation
> method and AG has been decided on, and hence the caller needs to
> manage the active references to ensure the allocation does not race
> with shrink removing the selected AG for the duration of the
> operation that requires access to that allocation group.
> 
> Other allocations iterate AGs and so the first AG is just a hint -
> these do not need to pin a perag first as they can tolerate not
> being able to access an AG by simply skipping over it. These require
> new perag iteration functions that can start at arbitrary AGs and
> wrap around at arbitrary AGs, hence a new set for
> for_each_perag_wrap*() helpers to do this.
> 
> Next is the rework of the filestreams allocator. This doesn't change
> any functionality, but gets rid of the unnecessary multi-pass
> selection algorithm when the selected AG is not available. It
> currently does a lookup pass which might iterate all AGs to select
> an AG, then checks if the AG is acceptible and if not does a "new
> AG" pass that is essentially identical to the lookup pass. Both of
> these scans also do the same "longest extent in AG" check before
> selecting an AG as is done after the AG is selected.
> 
> IOWs, the filestreams algorithm can be greatly simplified into a
> single new AG selection pass if the there is no current association
> or the currently associated AG doesn't have enough contiguous free
> space for the allocation to proceed.  With this simplification of
> the filestreams allocator, it's then trivial to convert it to use
> for_each_perag_wrap() for the AG scan algorithm.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> ----------------------------------------------------------------
> Dave Chinner (42):
>       xfs: fix low space alloc deadlock
>       xfs: prefer free inodes at ENOSPC over chunk allocation
>       xfs: block reservation too large for minleft allocation
>       xfs: drop firstblock constraints from allocation setup
>       xfs: t_firstblock is tracking AGs not blocks
>       xfs: don't assert fail on transaction cancel with deferred ops
>       xfs: active perag reference counting
>       xfs: rework the perag trace points to be perag centric
>       xfs: convert xfs_imap() to take a perag
>       xfs: use active perag references for inode allocation
>       xfs: inobt can use perags in many more places than it does
>       xfs: convert xfs_ialloc_next_ag() to an atomic
>       xfs: perags need atomic operational state
>       xfs: introduce xfs_for_each_perag_wrap()
>       xfs: rework xfs_alloc_vextent()
>       xfs: factor xfs_alloc_vextent_this_ag() for  _iterate_ags()
>       xfs: combine __xfs_alloc_vextent_this_ag and  xfs_alloc_ag_vextent
>       xfs: use xfs_alloc_vextent_this_ag() where appropriate
>       xfs: factor xfs_bmap_btalloc()
>       xfs: use xfs_alloc_vextent_first_ag() where appropriate
>       xfs: use xfs_alloc_vextent_start_bno() where appropriate
>       xfs: introduce xfs_alloc_vextent_near_bno()
>       xfs: introduce xfs_alloc_vextent_exact_bno()
>       xfs: introduce xfs_alloc_vextent_prepare()
>       xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
>       xfs: fold xfs_alloc_ag_vextent() into callers
>       xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
>       xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
>       xfs: convert trim to use for_each_perag_range
>       xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
>       xfs: get rid of notinit from xfs_bmap_longest_free_extent
>       xfs: use xfs_bmap_longest_free_extent() in filestreams
>       xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
>       xfs: merge filestream AG lookup into xfs_filestream_select_ag()
>       xfs: merge new filestream AG selection into xfs_filestream_select_ag()
>       xfs: remove xfs_filestream_select_ag() longest extent check
>       xfs: factor out MRU hit case in xfs_filestream_select_ag
>       xfs: track an active perag reference in filestreams
>       xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
>       xfs: pass perag to filestreams tracing
>       xfs: return a referenced perag from filestreams allocator
>       xfs: refactor the filestreams allocator pick functions
> 
>  fs/xfs/libxfs/xfs_ag.c             |  93 ++++-
>  fs/xfs/libxfs/xfs_ag.h             | 111 +++++-
>  fs/xfs/libxfs/xfs_ag_resv.c        |   2 +-
>  fs/xfs/libxfs/xfs_alloc.c          | 683 +++++++++++++++++++++++--------------
>  fs/xfs/libxfs/xfs_alloc.h          |  61 ++--
>  fs/xfs/libxfs/xfs_alloc_btree.c    |   2 +-
>  fs/xfs/libxfs/xfs_bmap.c           | 664 +++++++++++++++++------------------
>  fs/xfs/libxfs/xfs_bmap.h           |   7 +
>  fs/xfs/libxfs/xfs_bmap_btree.c     |  64 ++--
>  fs/xfs/libxfs/xfs_btree.c          |   2 +-
>  fs/xfs/libxfs/xfs_ialloc.c         | 241 ++++++-------
>  fs/xfs/libxfs/xfs_ialloc.h         |   5 +-
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |  47 ++-
>  fs/xfs/libxfs/xfs_ialloc_btree.h   |  20 +-
>  fs/xfs/libxfs/xfs_refcount_btree.c |  10 +-
>  fs/xfs/libxfs/xfs_rmap_btree.c     |   2 +-
>  fs/xfs/libxfs/xfs_sb.c             |   3 +-
>  fs/xfs/scrub/agheader_repair.c     |  35 +-
>  fs/xfs/scrub/bmap.c                |   2 +-
>  fs/xfs/scrub/common.c              |  21 +-
>  fs/xfs/scrub/fscounters.c          |  13 +-
>  fs/xfs/scrub/repair.c              |   7 +-
>  fs/xfs/xfs_bmap_util.c             |   2 +-
>  fs/xfs/xfs_discard.c               |  50 ++-
>  fs/xfs/xfs_filestream.c            | 455 ++++++++++++------------
>  fs/xfs/xfs_filestream.h            |   6 +-
>  fs/xfs/xfs_fsmap.c                 |   4 +-
>  fs/xfs/xfs_icache.c                |   8 +-
>  fs/xfs/xfs_inode.c                 |   2 +-
>  fs/xfs/xfs_iwalk.c                 |  10 +-
>  fs/xfs/xfs_mount.h                 |   3 +-
>  fs/xfs/xfs_reflink.c               |   4 +-
>  fs/xfs/xfs_super.c                 |  47 +--
>  fs/xfs/xfs_trace.h                 |  81 ++---
>  fs/xfs/xfs_trans.c                 |   8 +-
>  fs/xfs/xfs_trans.h                 |   2 +-
>  36 files changed, 1534 insertions(+), 1243 deletions(-)
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx