Re: [PATCH] xfs: don't reserve blocks for right shift transactions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 15, 2017 at 03:01:54PM -0500, Brian Foster wrote:
> On Wed, Feb 15, 2017 at 11:36:37AM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 15, 2017 at 02:15:30PM -0500, Brian Foster wrote:
> > > On Wed, Feb 15, 2017 at 10:09:55AM -0800, Darrick J. Wong wrote:
> > > > On Wed, Feb 15, 2017 at 10:05:28AM -0500, Brian Foster wrote:
> > > > > The block reservation for the transaction allocated in
> > > > > xfs_shift_file_space() is an artifact of the original collapse range
> > > > > support. It exists to handle the case where a collapse range occurs,
> > > > > the initial extent is left shifted into a location that forms a
> > > > > contiguous boundary with the previous extent and thus the extents
> > > > > are merged. This code was subsequently refactored and reused for
> > > > > insert range (right shift) support.
> > > > > 
> > > > > If an insert range occurs under low free space conditions, the
> > > > > extent at the starting offset is split before the first shift
> > > > > transaction is allocated. If the block reservation fails, this
> > > > > leaves separate, but contiguous extents around in the inode. While
> > > > > not a fatal problem, this is unexpected and will flag a warning on
> > > > > subsequent insert range operations on the inode. This problem has
> > > > > been reproduce intermittently by generic/270 running against a
> > > > > ramdisk device.
> > > > > 
> > > > > Since right shift does not create new extent boundaries in the
> > > > > inode, a block reservation for extent merge is unnecessary. Update
> > > > > xfs_shift_file_space() to conditionally reserve fs blocks for left
> > > > > shift transactions only. This avoids the warning reproduced by
> > > > > generic/270.
> > > > > 
> > > > > Reported-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > > > ---
> > > > >  fs/xfs/xfs_bmap_util.c | 20 ++++++++++----------
> > > > >  1 file changed, 10 insertions(+), 10 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > > index 7c3bfaf..6be5f26 100644
> > > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > > @@ -1385,10 +1385,16 @@ xfs_shift_file_space(
> > > > >  	xfs_fileoff_t		stop_fsb;
> > > > >  	xfs_fileoff_t		next_fsb;
> > > > >  	xfs_fileoff_t		shift_fsb;
> > > > > +	uint			resblks;
> > > > >  
> > > > >  	ASSERT(direction == SHIFT_LEFT || direction == SHIFT_RIGHT);
> > > > >  
> > > > >  	if (direction == SHIFT_LEFT) {
> > > > > +		/*
> > > > > +		 * Reserve blocks to cover potential extent merges after left
> > > > > +		 * shift operations.
> > > > > +		 */
> > > > > +		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
> > > > >  		next_fsb = XFS_B_TO_FSB(mp, offset + len);
> > > > >  		stop_fsb = XFS_B_TO_FSB(mp, VFS_I(ip)->i_size);
> > > > >  	} else {
> > > > > @@ -1396,6 +1402,7 @@ xfs_shift_file_space(
> > > > >  		 * If right shift, delegate the work of initialization of
> > > > >  		 * next_fsb to xfs_bmap_shift_extent as it has ilock held.
> > > > >  		 */
> > > > > +		resblks = 0;
> > > > 
> > > > Hmmmm.  I am convinced that this patch removes the most likely cause of
> > > > _trans_alloc failure, and therefore makes the g/270 failures go away.
> > > > 
> > > > However, I worry that if we split the extent and _trans_alloc fails for
> > > > some other reason (e.g. ENOMEM) then we'll still end up two adjacent
> > > > bmap extents that should be combined.  Granted, the only solution that I
> > > > can think of is very complicated (create a redo log item, link
> > > > everything together with the deferred ops mechanism, thereby making
> > > > right shift an atomic operation) for something that's unlikely to
> > > > happen(?) during an operation that might not be all that frequent
> > > > anyway.  I'm also not sure about the implications of adjacent mergeable
> > > > bmaps -- I think we can handle it, but it's not like I've researched
> > > > this thoroughly.
> > > > 
> > > > <shrug> Thoughts?
> > > > 
> > > 
> > > Yeah, this isn't a pure fix given the way the code is organized. I think
> > > I meant to point that out in the commit log; that technically this state
> > > is still possible, but probably not as likely to occur. I also
> > > considered killing off the warning, but it still seems useful to me for
> > > similar reasons. (Effectively, the motivation for this patch is really
> > > just to shut the test up. :).
> > 
> > :)
> > 
> > I tossed the patch on the 4.11 pile and we'll see if a quick round
> > of testing turns anything up.
> > 
> 
> Thanks.
> 
> > > I considered error handling just enough to realize that there wasn't a
> > > simple solution. Given that this change seemed correct regardless, I
> > > figured this works for now and we can revisit if this remains a problem
> > > in practice.
> > > 
> > > Beyond that... an atomic rewrite using the deferred ops stuff seems like
> > > a reasonable approach technically, but probably should be more motivated
> > > by the broader fact that afaict any of the collapse/insert range
> > > operations can fail midway through the overall operation and leave the
> > > file in a halfway shifted state with respect to the original request.
> > 
> > Yep.
> > 
> > > Would deferred ops address that problem (e.g., what if a subsequent
> > > transaction allocation failed in that model, after one or more extents
> > > had already been shifted)?
> > 
> > Yep, that's exactly what deferred ops do -- it queues up a large
> > operation, logs an intent for the whole operation, and re-logs
> > successively smaller intent items as we incrementally finish little
> > pieces of the big operation.  If something happens midway through, we go
> > offline so that later on journal recovery will know exactly where we
> > left off and can restart the operation.
> > 
> 
> So is part of the tradeoff for the atomicity of the deferred ops model
> that any failure midway through becomes a shutdown event as opposed to a
> potential userspace error return (notwithstanding cancellation of a
> dirty transaction, which is always a shutdown)?

Yes, since we've started updating metadata but haven't finished, which
implies that the fs is inconsistent.  It was designed this way to avoid
the situation where a bmbt update commits but its corresponding rmap
update fails in between transactions, so now the metadata are
inconsistent.

For extent shifting the justification for shutting down is more tenuous
because the fs isn't inconsistent; we simply have a file operation that
failed midway through and fallocate doesn't tell userspace which part
of the operation actually did succeed so that it can react accordingly.
If, say, we manage to move one or two extents in a sparse file then
it may be very hard to tell which extents did /not/ get moved.

> > (This is a little dangerous -- if large operation fails because of
> > corrupted metadata then recovery will never succeed, forcing the user to
> > zero the log and see what xfs_repair will do...)
> > 
> 
> Indeed.

Some day if xfs ever grows the ability to roll back transactions we
might be able to avoid some of this...

> 
> Brian
> 
> > If we ever /do/ redesign the code to use defer ops then we'll also have
> > to set a log incompat feature flag because right now we only ever use
> > the bmap defer ops if rmap or reflink are enabled.
> > 
> > > Then again, I _thought_ that all came up when the collapse range stuff
> > > was originally posted and wasn't considered a major problem to the users
> > > (either that or we didn't have a straightforward approach to make the
> > > whole thing atomic at the time) because ultimately the operation can be
> > > retried or the original state recovered from userspace...
> > 
> > Userspace can try to figure out where things went wrong and restart the
> > operation, but if they don't then we get left in this weird state with
> > adjacent bmaps.  There wasn't a straightforward way to create an atomic
> > compound operation until deferred ops came along in 4.8.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > >  		next_fsb = NULLFSBLOCK;
> > > > >  		stop_fsb = XFS_B_TO_FSB(mp, offset);
> > > > >  	}
> > > > > @@ -1437,21 +1444,14 @@ xfs_shift_file_space(
> > > > >  	}
> > > > >  
> > > > >  	while (!error && !done) {
> > > > > -		/*
> > > > > -		 * We would need to reserve permanent block for transaction.
> > > > > -		 * This will come into picture when after shifting extent into
> > > > > -		 * hole we found that adjacent extents can be merged which
> > > > > -		 * may lead to freeing of a block during record update.
> > > > > -		 */
> > > > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > > > > -				XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > > > > +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0,
> > > > > +					&tp);
> > > > >  		if (error)
> > > > >  			break;
> > > > >  
> > > > >  		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > >  		error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot,
> > > > > -				ip->i_gdquot, ip->i_pdquot,
> > > > > -				XFS_DIOSTRAT_SPACE_RES(mp, 0), 0,
> > > > > +				ip->i_gdquot, ip->i_pdquot, resblks, 0,
> > > > >  				XFS_QMOPT_RES_REGBLKS);
> > > > >  		if (error)
> > > > >  			goto out_trans_cancel;
> > > > > -- 
> > > > > 2.7.4
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux