Re: [PATCH 48/63] xfs: preallocate blocks for worst-case btree expansion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 29, 2016 at 08:10:52PM -0700, Darrick J. Wong wrote:
> To gracefully handle the situation where a CoW operation turns a
> single refcount extent into a lot of tiny ones and then run out of
> space when a tree split has to happen, use the per-AG reserved block
> pool to pre-allocate all the space we'll ever need for a maximal
> btree.  For a 4K block size, this only costs an overhead of 0.3% of
> available disk space.
> 
> When reflink is enabled, we have an unfortunate problem with rmap --
> since we can share a block billions of times, this means that the
> reverse mapping btree can expand basically infinitely.  When an AG is
> so full that there are no free blocks with which to expand the rmapbt,
> the filesystem will shut down hard.
> 
> This is rather annoying to the user, so use the AG reservation code to
> reserve a "reasonable" amount of space for rmap.  We'll prevent
> reflinks and CoW operations if we think we're getting close to
> exhausting an AG's free space rather than shutting down, but this
> permanent reservation should be enough for "most" users.  Hopefully.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> [hch@xxxxxx: ensure that we invalidate the freed btree buffer]
> Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> ---
> v2: Simplify the return value from xfs_perag_pool_free_block to a bool
> so that we can easily call xfs_trans_binval for both the per-AG pool
> and the real freeing case.  Without this we fail to invalidate the
> btree buffer and will trip over the write verifier on a shrinking
> refcount btree.
> 
> v3: Convert to the new per-AG reservation code.
> 
> v4: Combine this patch with the one that adds the rmapbt reservation,
> since the rmapbt reservation is only needed for reflink filesystems.
> 
> v5: If we detect errors while counting the refcount or rmap btrees,
> shut down the filesystem to avoid the scenario where the fs shuts down
> mid-transaction due to btree corruption, repair refuses to run until
> the log is clean, and the log cannot be cleaned because replay hits
> btree corruption and shuts down.
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c        |   11 ++++++
>  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++++-
>  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
>  fs/xfs/libxfs/xfs_rmap_btree.c     |   60 ++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h     |    7 ++++
>  fs/xfs/xfs_fsops.c                 |   64 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_fsops.h                 |    3 ++
>  fs/xfs/xfs_mount.c                 |    8 +++++
>  fs/xfs/xfs_super.c                 |   12 +++++++
>  9 files changed, 210 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index e3ae0f2..adf770f 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -38,6 +38,7 @@
>  #include "xfs_trans_space.h"
>  #include "xfs_rmap_btree.h"
>  #include "xfs_btree.h"
> +#include "xfs_refcount_btree.h"
>  
>  /*
>   * Per-AG Block Reservations
> @@ -228,6 +229,11 @@ xfs_ag_resv_init(
>  	if (pag->pag_meta_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> +		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> +				pag->pag_agno, &ask, &used);
> +		if (error)
> +			goto out;
> +
>  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
>  				ask, used);

Now that I get here, I see we have these per-ag reservation structures
and whatnot, but __xfs_ag_resv_init() (from a previous patch) calls
xfs_mod_fdblocks() for the reservation. AFAICT, that reserves from the
"global pool." Based on the commit log, isn't the intent here to reserve
blocks within each AG? What am I missing?

Brian

>  		if (error)
> @@ -238,6 +244,11 @@ xfs_ag_resv_init(
>  	if (pag->pag_agfl_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> +		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> +				&ask, &used);
> +		if (error)
> +			goto out;
> +
>  		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
>  		if (error)
>  			goto out;
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 6b5e82b9..453bb27 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -79,6 +79,8 @@ xfs_refcountbt_alloc_block(
>  	struct xfs_alloc_arg	args;		/* block allocation args */
>  	int			error;		/* error return value */
>  
> +	XFS_BTREE_TRACE_CURSOR(cur, XBT_ENTRY);
> +
>  	memset(&args, 0, sizeof(args));
>  	args.tp = cur->bc_tp;
>  	args.mp = cur->bc_mp;
> @@ -88,6 +90,7 @@ xfs_refcountbt_alloc_block(
>  	args.firstblock = args.fsbno;
>  	xfs_rmap_ag_owner(&args.oinfo, XFS_RMAP_OWN_REFC);
>  	args.minlen = args.maxlen = args.prod = 1;
> +	args.resv = XFS_AG_RESV_METADATA;
>  
>  	error = xfs_alloc_vextent(&args);
>  	if (error)
> @@ -125,16 +128,19 @@ xfs_refcountbt_free_block(
>  	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
>  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
>  	struct xfs_owner_info	oinfo;
> +	int			error;
>  
>  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_private.a.agno,
>  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
>  	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
>  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
>  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> -	xfs_bmap_add_free(mp, cur->bc_private.a.dfops, fsbno, 1,
> -			&oinfo);
> +	error = xfs_free_extent(cur->bc_tp, fsbno, 1, &oinfo,
> +			XFS_AG_RESV_METADATA);
> +	if (error)
> +		return error;
>  
> -	return 0;
> +	return error;
>  }
>  
>  STATIC int
> @@ -410,3 +416,36 @@ xfs_refcountbt_max_size(
>  
>  	return xfs_refcountbt_calc_size(mp, mp->m_sb.sb_agblocks);
>  }
> +
> +/*
> + * Figure out how many blocks to reserve and how many are used by this btree.
> + */
> +int
> +xfs_refcountbt_calc_reserves(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_extlen_t		*ask,
> +	xfs_extlen_t		*used)
> +{
> +	struct xfs_buf		*agbp;
> +	struct xfs_agf		*agf;
> +	xfs_extlen_t		tree_len;
> +	int			error;
> +
> +	if (!xfs_sb_version_hasreflink(&mp->m_sb))
> +		return 0;
> +
> +	*ask += xfs_refcountbt_max_size(mp);
> +
> +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> +	if (error)
> +		return error;
> +
> +	agf = XFS_BUF_TO_AGF(agbp);
> +	tree_len = be32_to_cpu(agf->agf_refcount_blocks);
> +	xfs_buf_relse(agbp);
> +
> +	*used += tree_len;
> +
> +	return error;
> +}
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> index 780b02f..3be7768 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> @@ -68,4 +68,7 @@ extern xfs_extlen_t xfs_refcountbt_calc_size(struct xfs_mount *mp,
>  		unsigned long long len);
>  extern xfs_extlen_t xfs_refcountbt_max_size(struct xfs_mount *mp);
>  
> +extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
> +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> +
>  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index 9c0585e..83e672f 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -35,6 +35,7 @@
>  #include "xfs_cksum.h"
>  #include "xfs_error.h"
>  #include "xfs_extent_busy.h"
> +#include "xfs_ag_resv.h"
>  
>  /*
>   * Reverse map btree.
> @@ -533,3 +534,62 @@ xfs_rmapbt_compute_maxlevels(
>  		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(mp,
>  				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
>  }
> +
> +/* Calculate the refcount btree size for some records. */
> +xfs_extlen_t
> +xfs_rmapbt_calc_size(
> +	struct xfs_mount	*mp,
> +	unsigned long long	len)
> +{
> +	return xfs_btree_calc_size(mp, mp->m_rmap_mnr, len);
> +}
> +
> +/*
> + * Calculate the maximum refcount btree size.
> + */
> +xfs_extlen_t
> +xfs_rmapbt_max_size(
> +	struct xfs_mount	*mp)
> +{
> +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> +	if (mp->m_rmap_mxr[0] == 0)
> +		return 0;
> +
> +	return xfs_rmapbt_calc_size(mp, mp->m_sb.sb_agblocks);
> +}
> +
> +/*
> + * Figure out how many blocks to reserve and how many are used by this btree.
> + */
> +int
> +xfs_rmapbt_calc_reserves(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_extlen_t		*ask,
> +	xfs_extlen_t		*used)
> +{
> +	struct xfs_buf		*agbp;
> +	struct xfs_agf		*agf;
> +	xfs_extlen_t		pool_len;
> +	xfs_extlen_t		tree_len;
> +	int			error;
> +
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return 0;
> +
> +	/* Reserve 1% of the AG or enough for 1 block per record. */
> +	pool_len = max(mp->m_sb.sb_agblocks / 100, xfs_rmapbt_max_size(mp));
> +	*ask += pool_len;
> +
> +	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
> +	if (error)
> +		return error;
> +
> +	agf = XFS_BUF_TO_AGF(agbp);
> +	tree_len = be32_to_cpu(agf->agf_rmap_blocks);
> +	xfs_buf_relse(agbp);
> +
> +	*used += tree_len;
> +
> +	return error;
> +}
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index e73a553..2a9ac47 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -58,4 +58,11 @@ struct xfs_btree_cur *xfs_rmapbt_init_cursor(struct xfs_mount *mp,
>  int xfs_rmapbt_maxrecs(struct xfs_mount *mp, int blocklen, int leaf);
>  extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
>  
> +extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
> +		unsigned long long len);
> +extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp);
> +
> +extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp,
> +		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
> +
>  #endif	/* __XFS_RMAP_BTREE_H__ */
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 3acbf4e0..93d12fa 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -43,6 +43,7 @@
>  #include "xfs_log.h"
>  #include "xfs_filestream.h"
>  #include "xfs_rmap.h"
> +#include "xfs_ag_resv.h"
>  
>  /*
>   * File system operations
> @@ -630,6 +631,11 @@ xfs_growfs_data_private(
>  	xfs_set_low_space_thresholds(mp);
>  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
>  
> +	/* Reserve AG metadata blocks. */
> +	error = xfs_fs_reserve_ag_blocks(mp);
> +	if (error && error != -ENOSPC)
> +		goto out;
> +
>  	/* update secondary superblocks. */
>  	for (agno = 1; agno < nagcount; agno++) {
>  		error = 0;
> @@ -680,6 +686,8 @@ xfs_growfs_data_private(
>  			continue;
>  		}
>  	}
> +
> + out:
>  	return saved_error ? saved_error : error;
>  
>   error0:
> @@ -989,3 +997,59 @@ xfs_do_force_shutdown(
>  	"Please umount the filesystem and rectify the problem(s)");
>  	}
>  }
> +
> +/*
> + * Reserve free space for per-AG metadata.
> + */
> +int
> +xfs_fs_reserve_ag_blocks(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_agnumber_t		agno;
> +	struct xfs_perag	*pag;
> +	int			error = 0;
> +	int			err2;
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		pag = xfs_perag_get(mp, agno);
> +		err2 = xfs_ag_resv_init(pag);
> +		xfs_perag_put(pag);
> +		if (err2 && !error)
> +			error = err2;
> +	}
> +
> +	if (error && error != -ENOSPC) {
> +		xfs_warn(mp,
> +	"Error %d reserving per-AG metadata reserve pool.", error);
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +	}
> +
> +	return error;
> +}
> +
> +/*
> + * Free space reserved for per-AG metadata.
> + */
> +int
> +xfs_fs_unreserve_ag_blocks(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_agnumber_t		agno;
> +	struct xfs_perag	*pag;
> +	int			error = 0;
> +	int			err2;
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		pag = xfs_perag_get(mp, agno);
> +		err2 = xfs_ag_resv_free(pag);
> +		xfs_perag_put(pag);
> +		if (err2 && !error)
> +			error = err2;
> +	}
> +
> +	if (error)
> +		xfs_warn(mp,
> +	"Error %d freeing per-AG metadata reserve pool.", error);
> +
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
> index f32713f..f349158 100644
> --- a/fs/xfs/xfs_fsops.h
> +++ b/fs/xfs/xfs_fsops.h
> @@ -26,4 +26,7 @@ extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval,
>  				xfs_fsop_resblks_t *outval);
>  extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags);
>  
> +extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
> +extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
> +
>  #endif	/* __XFS_FSOPS_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index caecbd2..b5da81d 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -986,10 +986,17 @@ xfs_mountfs(
>  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  			goto out_quota;
>  		}
> +
> +		/* Reserve AG blocks for future btree expansion. */
> +		error = xfs_fs_reserve_ag_blocks(mp);
> +		if (error && error != -ENOSPC)
> +			goto out_agresv;
>  	}
>  
>  	return 0;
>  
> + out_agresv:
> +	xfs_fs_unreserve_ag_blocks(mp);
>   out_quota:
>  	xfs_qm_unmount_quotas(mp);
>   out_rtunmount:
> @@ -1034,6 +1041,7 @@ xfs_unmountfs(
>  
>  	cancel_delayed_work_sync(&mp->m_eofblocks_work);
>  
> +	xfs_fs_unreserve_ag_blocks(mp);
>  	xfs_qm_unmount_quotas(mp);
>  	xfs_rtunmount_inodes(mp);
>  	IRELE(mp->m_rootip);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e6aaa91..875ab9f 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1315,10 +1315,22 @@ xfs_fs_remount(
>  			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  			return error;
>  		}
> +
> +		/* Create the per-AG metadata reservation pool .*/
> +		error = xfs_fs_reserve_ag_blocks(mp);
> +		if (error && error != -ENOSPC)
> +			return error;
>  	}
>  
>  	/* rw -> ro */
>  	if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
> +		/* Free the per-AG metadata reservation pool. */
> +		error = xfs_fs_unreserve_ag_blocks(mp);
> +		if (error) {
> +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +			return error;
> +		}
> +
>  		/*
>  		 * Before we sync the metadata, we need to free up the reserve
>  		 * block pool so that the used block count in the superblock on
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux