Re: [PATCH 1/2] xfs: reserve quota for dir expansion when linking/unlinking files

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 9 Mar 2022 15:33:02 -0800

On Thu, Mar 10, 2022 at 08:48:21AM +1100, Dave Chinner wrote:
> On Wed, Mar 09, 2022 at 11:22:26AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > XFS does not reserve quota for directory expansion when linking or
> > unlinking children from a directory.  This means that we don't reject
> > the expansion with EDQUOT when we're at or near a hard limit, which
> > means that unprivileged userspace can use link()/unlink() to exceed
> > quota.
> > 
> > The fix for this is nuanced -- link operations don't always expand the
> > directory, and we allow a link to proceed with no space reservation if
> > we don't need to add a block to the directory to handle the addition.
> > Unlink operations generally do not expand the directory (you'd have to
> > free a block and then cause a btree split) and we can defer the
> > directory block freeing if there is no space reservation.
> > 
> > Moreover, there is a further bug in that we do not trigger the blockgc
> > workers to try to clear space when we're out of quota.
> > 
> > To fix both cases, create a new xfs_trans_alloc_dir function that
> > allocates the transaction, locks and joins the inodes, and reserves
> > quota for the directory.  If there isn't sufficient space or quota,
> > we'll switch the caller to reservationless mode.  This should prevent
> > quota usage overruns with the least restriction in functionality.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_inode.c |   30 +++++-------------
> >  fs/xfs/xfs_trans.c |   86 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_trans.h |    3 ++
> >  3 files changed, 97 insertions(+), 22 deletions(-)
> 
> Overall looks good, minor nits below:
> 
> > 
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 04bf467b1090..a131bbfe74e4 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1217,7 +1217,7 @@ xfs_link(
> >  {
> >  	xfs_mount_t		*mp = tdp->i_mount;
> >  	xfs_trans_t		*tp;
> > -	int			error;
> > +	int			error, space_error;
> >  	int			resblks;
> >  
> >  	trace_xfs_link(tdp, target_name);
> > @@ -1236,19 +1236,11 @@ xfs_link(
> >  		goto std_return;
> >  
> >  	resblks = XFS_LINK_SPACE_RES(mp, target_name->len);
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, resblks, 0, 0, &tp);
> > -	if (error == -ENOSPC) {
> > -		resblks = 0;
> > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, 0, 0, 0, &tp);
> > -	}
> > +	error = xfs_trans_alloc_dir(tdp, &M_RES(mp)->tr_link, sip, &resblks,
> > +			&tp, &space_error);
> 
> It's the nospace_error, isn't it? The code reads a lot better when
> it's called that, too.

Fixed.

> 
> >  	if (error)
> >  		goto std_return;
> >  
> > -	xfs_lock_two_inodes(sip, XFS_ILOCK_EXCL, tdp, XFS_ILOCK_EXCL);
> > -
> > -	xfs_trans_ijoin(tp, sip, XFS_ILOCK_EXCL);
> > -	xfs_trans_ijoin(tp, tdp, XFS_ILOCK_EXCL);
> > -
> >  	error = xfs_iext_count_may_overflow(tdp, XFS_DATA_FORK,
> >  			XFS_IEXT_DIR_MANIP_CNT(mp));
> >  	if (error)
> > @@ -1267,6 +1259,8 @@ xfs_link(
> >  
> >  	if (!resblks) {
> >  		error = xfs_dir_canenter(tp, tdp, target_name);
> > +		if (error == -ENOSPC && space_error)
> > +			error = space_error;
> 
> This  would be better in the error_return stack, I think. That way
> the transformation only has to be done once, and it will be done for
> all functions that can potentially return ENOSPC.

Ok.

> >  		if (error)
> >  			goto error_return;
> >  	}
> > @@ -2755,6 +2749,7 @@ xfs_remove(
> >  	xfs_mount_t		*mp = dp->i_mount;
> >  	xfs_trans_t             *tp = NULL;
> >  	int			is_dir = S_ISDIR(VFS_I(ip)->i_mode);
> > +	int			dontcare;
> >  	int                     error = 0;
> >  	uint			resblks;
> >  
> > @@ -2781,22 +2776,13 @@ xfs_remove(
> >  	 * block from the directory.
> >  	 */
> >  	resblks = XFS_REMOVE_SPACE_RES(mp);
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_remove, resblks, 0, 0, &tp);
> > -	if (error == -ENOSPC) {
> > -		resblks = 0;
> > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_remove, 0, 0, 0,
> > -				&tp);
> > -	}
> > +	error = xfs_trans_alloc_dir(dp, &M_RES(mp)->tr_remove, ip, &resblks,
> > +			&tp, &dontcare);
> >  	if (error) {
> >  		ASSERT(error != -ENOSPC);
> >  		goto std_return;
> >  	}
> 
> So we just ignore -EDQUOT when it is returned in @dontcare? I'd like
> a comment to explain why we don't care about EDQUOT here, because
> the next time I look at this I will have forgotten all about this...

Ok.  How about:

	/*
	 * We try to get the real space reservation first, allowing for
	 * directory btree deletion(s) implying possible bmap insert(s).
	 * If we can't get the space reservation then we use 0 instead,
	 * and avoid the bmap btree insert(s) in the directory code by,
	 * if the bmap insert tries to happen, instead trimming the LAST
	 * block from the directory.
	 *
	 * Ignore EDQUOT and ENOSPC being returned via nospace_error
	 * because the directory code can handle a reservationless
	 * update and we don't want to prevent a user from trying to
	 * free space by deleting things.
	 */
	error = xfs_trans_alloc_dir(...);

--D

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx