Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 12, 2016 at 02:44:51PM -0400, Brian Foster wrote:
> On Thu, Sep 29, 2016 at 08:10:52PM -0700, Darrick J. Wong wrote:
> > To gracefully handle the situation where a CoW operation turns a
> > single refcount extent into a lot of tiny ones and then run out of
> > space when a tree split has to happen, use the per-AG reserved block
> > pool to pre-allocate all the space we'll ever need for a maximal
> > btree.  For a 4K block size, this only costs an overhead of 0.3% of
> > available disk space.
> > 
> > When reflink is enabled, we have an unfortunate problem with rmap --
> > since we can share a block billions of times, this means that the
> > reverse mapping btree can expand basically infinitely.  When an AG is
> > so full that there are no free blocks with which to expand the rmapbt,
> > the filesystem will shut down hard.
> > 
> > This is rather annoying to the user, so use the AG reservation code to
> > reserve a "reasonable" amount of space for rmap.  We'll prevent
> > reflinks and CoW operations if we think we're getting close to
> > exhausting an AG's free space rather than shutting down, but this
> > permanent reservation should be enough for "most" users.  Hopefully.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > [hch@xxxxxx: ensure that we invalidate the freed btree buffer]
> > Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> > ---
> > v2: Simplify the return value from xfs_perag_pool_free_block to a bool
> > so that we can easily call xfs_trans_binval for both the per-AG pool
> > and the real freeing case.  Without this we fail to invalidate the
> > btree buffer and will trip over the write verifier on a shrinking
> > refcount btree.
> > 
> > v3: Convert to the new per-AG reservation code.
> > 
> > v4: Combine this patch with the one that adds the rmapbt reservation,
> > since the rmapbt reservation is only needed for reflink filesystems.
> > 
> > v5: If we detect errors while counting the refcount or rmap btrees,
> > shut down the filesystem to avoid the scenario where the fs shuts down
> > mid-transaction due to btree corruption, repair refuses to run until
> > the log is clean, and the log cannot be cleaned because replay hits
> > btree corruption and shuts down.
> > ---
> >  fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
> >  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
> >  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
> >  fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
> >  fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_fsops.h                 |    3 ++
> >  fs/xfs/xfs_mount.c                 |    8 +++++
> >  fs/xfs/xfs_super.c                 |   12 +++++++
> >  9 files changed, 210 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > index e3ae0f2..adf770f 100644
> > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > @@ -38,6 +38,7 @@
> >  #include "xfs_trans_space.h"
> >  #include "xfs_rmap_btree.h"
> >  #include "xfs_btree.h"
> > +#include "xfs_refcount_btree.h"
> >  
> >  /*
> >   * Per-AG Block Reservations
> > @@ -228,6 +229,11 @@ xfs_ag_resv_init(
> >  	if (pag->pag_meta_resv.ar_asked == 0) {
> >  		ask = used = 0;
> >  
> > +		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> > +				pag->pag_agno, &ask, &used);
> > +		if (error)
> > +			goto out;
> > +
> >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> >  				ask, used);
> 
> Now that I get here, I see we have these per-ag reservation structures
> and whatnot, but __xfs_ag_resv_init() (from a previous patch) calls
> xfs_mod_fdblocks() for the reservation. AFAICT, that reserves from the
> "global pool." Based on the commit log, isn't the intent here to reserve
> blocks within each AG? What am I missing?

The AG reservation code "reserves" blocks in each AG by hiding them from
the allocator.  They're all still there in the bnobt, but we underreport
the length of the longest free extent and the free block count in that
AG to make it look like there's less free space than there is.  Since
those blocks are no longer generally available, we have to decrease the
in-core free block count so we can't create delalloc reservations that
the allocator won't (or can't) satisfy.

Maybe a more concrete way to put that is: say we have 4 AGs with 4 agresv
blocks each, and no other free space left anywhere.  The in-core fdblocks count
should be 0 so that starting a write into a hole returns ENOSPC even if the
write could be done without any btree shape changes.   Otherwise, writepages
tries to allocate the delalloc reservation, fails to find any space because
we've hidden it, and kaboom.

--D

> 
> Brian
> 
> >  		if (error)
> > @@ -238,6 +244,11 @@ xfs_ag_resv_init(
> >  	if (pag->pag_agfl_resv.ar_asked == 0) {
> >  		ask = used = 0;
> >  
> > +		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> > +				&ask, &used);
> > +		if (error)
> > +			goto out;
> > +
> >  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
> >  		if (error)
> >  			goto out;
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > index 6b5e82b9..453bb27 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > @@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
> >  	struct xfs_alloc_arg	args;		/* block allocation args */
> >  	int			error;		/* error return value */
> >  
> > +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> > +
> >  	memset(&args, 0, sizeof(args));
> >  	args.tp = cur->bc_tp;
> >  	args.mp = cur->bc_mp;
> > @@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
> >  	args.firstblock = args.fsbno;
> >  	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
> >  	args.minlen = args.maxlen = args.prod = 1;
> > +	args.resv = XFS_AG_RESV_METADATA;
> >  
> >  	error = xfs_alloc_vextent(&args);
> >  	if (error)
> > @@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
> >  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> >  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
> >  	struct xfs_owner_info	oinfo;
> > +	int			error;
> >  
> >  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
> >  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
> >  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
> >  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
> >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> > -	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
> > -			&oinfo);
> > +	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> > +			XFS_AG_RESV_METADATA);
> > +	if (error)
> > +		return error;
> >  
> > -	return 0;
> > +	return error;
> >  }
> >  
> >  STATIC int
> > @@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
> >  
> >  	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
> >  }
> > +
> > +/*
> > + * Figure out how many blocks to reserve and how many are used by this btree.
> > + */
> > +int
> > +xfs_refcountbt_calc_reserves(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno,
> > +	xfs_extlen_t		*ask,
> > +	xfs_extlen_t		*used)
> > +{
> > +	struct xfs_buf		*agbp;
> > +	struct xfs_agf		*agf;
> > +	xfs_extlen_t		tree_len;
> > +	int			error;
> > +
> > +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> > +		return 0;
> > +
> > +	*ask += xfs_refcountbt_max_size(mp);
> > +
> > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > +	if (error)
> > +		return error;
> > +
> > +	agf = XFS_BUF_TO_AGF(agbp);
> > +	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
> > +	xfs_buf_relse(agbp);
> > +
> > +	*used += tree_len;
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> > index 780b02f..3be7768 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> > @@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
> >  		unsigned long long len);
> >  extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
> >  
> > +extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
> > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > +
> >  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > index 9c0585e..83e672f 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > @@ -35,6 +35,7 @@
> >  #include "xfs_cksum.h"
> >  #include "xfs_error.h"
> >  #include "xfs_extent_busy.h"
> > +#include "xfs_ag_resv.h"
> >  
> >  /*
> >   * Reverse map btree.
> > @@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
> >  		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
> >  				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> >  }
> > +
> > +/* Calculate the refcount btree size for some records. */
> > +xfs_extlen_t
> > +xfs_rmapbt_calc_size(
> > +	struct xfs_mount	*mp,
> > +	unsigned long long	len)
> > +{
> > +	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
> > +}
> > +
> > +/*
> > + * Calculate the maximum refcount btree size.
> > + */
> > +xfs_extlen_t
> > +xfs_rmapbt_max_size(
> > +	struct xfs_mount	*mp)
> > +{
> > +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > +	if (mp->m_rmap_mxr[0] == 0)
> > +		return 0;
> > +
> > +	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
> > +}
> > +
> > +/*
> > + * Figure out how many blocks to reserve and how many are used by this btree.
> > + */
> > +int
> > +xfs_rmapbt_calc_reserves(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno,
> > +	xfs_extlen_t		*ask,
> > +	xfs_extlen_t		*used)
> > +{
> > +	struct xfs_buf		*agbp;
> > +	struct xfs_agf		*agf;
> > +	xfs_extlen_t		pool_len;
> > +	xfs_extlen_t		tree_len;
> > +	int			error;
> > +
> > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return 0;
> > +
> > +	/* Reserve 1% of the AG or enough for 1 block per record. */
> > +	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
> > +	*ask += pool_len;
> > +
> > +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> > +	if (error)
> > +		return error;
> > +
> > +	agf = XFS_BUF_TO_AGF(agbp);
> > +	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
> > +	xfs_buf_relse(agbp);
> > +
> > +	*used += tree_len;
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index e73a553..2a9ac47 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
> >  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
> >  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> >  
> > +extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
> > +		unsigned long long len);
> > +extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
> > +
> > +extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
> > +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> > +
> >  #endif	/* __XFS_RMAP_BTREE_H__ */
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 3acbf4e0..93d12fa 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -43,6 +43,7 @@
> >  #include "xfs_log.h"
> >  #include "xfs_filestream.h"
> >  #include "xfs_rmap.h"
> > +#include "xfs_ag_resv.h"
> >  
> >  /*
> >   * File system operations
> > @@ -630,6 +631,11 @@ xfs_growfs_data_private(
> >  	xfs_set_low_space_thresholds(mp);
> >  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> >  
> > +	/* Reserve AG metadata blocks. */
> > +	error = xfs_fs_reserve_ag_blocks(mp);
> > +	if (error && error != -ENOSPC)
> > +		goto out;
> > +
> >  	/* update secondary superblocks. */
> >  	for (agno = 1; agno < nagcount; agno++) {
> >  		error = 0;
> > @@ -680,6 +686,8 @@ xfs_growfs_data_private(
> >  			continue;
> >  		}
> >  	}
> > +
> > + out:
> >  	return saved_error ? saved_error : error;
> >  
> >   error0:
> > @@ -989,3 +997,59 @@ xfs_do_force_shutdown(
> >  	"Please umount the filesystem and rectify the problem(s)");
> >  	}
> >  }
> > +
> > +/*
> > + * Reserve free space for per-AG metadata.
> > + */
> > +int
> > +xfs_fs_reserve_ag_blocks(
> > +	struct xfs_mount	*mp)
> > +{
> > +	xfs_agnumber_t		agno;
> > +	struct xfs_perag	*pag;
> > +	int			error = 0;
> > +	int			err2;
> > +
> > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > +		pag = xfs_perag_get(mp, agno);
> > +		err2 = xfs_ag_resv_init(pag);
> > +		xfs_perag_put(pag);
> > +		if (err2 && !error)
> > +			error = err2;
> > +	}
> > +
> > +	if (error && error != -ENOSPC) {
> > +		xfs_warn(mp,
> > +	"Error %d reserving per-AG metadata reserve pool.", error);
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +	}
> > +
> > +	return error;
> > +}
> > +
> > +/*
> > + * Free space reserved for per-AG metadata.
> > + */
> > +int
> > +xfs_fs_unreserve_ag_blocks(
> > +	struct xfs_mount	*mp)
> > +{
> > +	xfs_agnumber_t		agno;
> > +	struct xfs_perag	*pag;
> > +	int			error = 0;
> > +	int			err2;
> > +
> > +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> > +		pag = xfs_perag_get(mp, agno);
> > +		err2 = xfs_ag_resv_free(pag);
> > +		xfs_perag_put(pag);
> > +		if (err2 && !error)
> > +			error = err2;
> > +	}
> > +
> > +	if (error)
> > +		xfs_warn(mp,
> > +	"Error %d freeing per-AG metadata reserve pool.", error);
> > +
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> > index f32713f..f349158 100644
> > --- a/fs/xfs/xfs_fsops.h
> > +++ b/fs/xfs/xfs_fsops.h
> > @@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
> >  				xfs_fsop_resblks_t *outval);
> >  extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
> >  
> > +extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
> > +extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
> > +
> >  #endif	/* __XFS_FSOPS_H__ */
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index caecbd2..b5da81d 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -986,10 +986,17 @@ xfs_mountfs(
> >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> >  			goto out_quota;
> >  		}
> > +
> > +		/* Reserve AG blocks for future btree expansion. */
> > +		error = xfs_fs_reserve_ag_blocks(mp);
> > +		if (error && error != -ENOSPC)
> > +			goto out_agresv;
> >  	}
> >  
> >  	return 0;
> >  
> > + out_agresv:
> > +	xfs_fs_unreserve_ag_blocks(mp);
> >   out_quota:
> >  	xfs_qm_unmount_quotas(mp);
> >   out_rtunmount:
> > @@ -1034,6 +1041,7 @@ xfs_unmountfs(
> >  
> >  	cancel_delayed_work_sync(&mp->m_eofblocks_work);
> >  
> > +	xfs_fs_unreserve_ag_blocks(mp);
> >  	xfs_qm_unmount_quotas(mp);
> >  	xfs_rtunmount_inodes(mp);
> >  	IRELE(mp->m_rootip);
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index e6aaa91..875ab9f 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1315,10 +1315,22 @@ xfs_fs_remount(
> >  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> >  			return error;
> >  		}
> > +
> > +		/* Create the per-AG metadata reservation pool .*/
> > +		error = xfs_fs_reserve_ag_blocks(mp);
> > +		if (error && error != -ENOSPC)
> > +			return error;
> >  	}
> >  
> >  	/* rw -> ro */
> >  	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
> > +		/* Free the per-AG metadata reservation pool. */
> > +		error = xfs_fs_unreserve_ag_blocks(mp);
> > +		if (error) {
> > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +			return error;
> > +		}
> > +
> >  		/*
> >  		 * Before we sync the metadata, we need to free up the reserve
> >  		 * block pool so that the used block count in the superblock on
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux