Re: [PATCH 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 8 Oct 2020 07:29:31 -0400

On Wed, Oct 07, 2020 at 03:24:16PM -0700, Darrick J. Wong wrote:
> On Thu, Oct 01, 2020 at 11:03:10AM -0400, Brian Foster wrote:
> > The quotaoff operation logs two log items. The start item is
> > committed first, followed by the bulk of the in-core quotaoff
> > processing, and then the quotaoff end item is committed to release
> > the start item from the log. The problem with this mechanism is that
> > quite a bit of processing can be required to release dquots from all
> > in-core inodes and subsequently flush/purge all dquots in the
> > system. This processing work doesn't generally generate much log
> > traffic itself, but the start item pins the tail of the log. If an
> > external workload consumes the remaining log space before the
> > transaction for the end item is allocated, a log deadlock can occur.
> > 
> > The purpose of the separate start and end log items is primarily to
> > ensure that log recovery does not incorrectly recover dquot data
> > after an fs crash where a quotaoff was in progress. If we only
> > logged a single quotaoff item, for example, it could fall behind the
> > tail of the log before the last dquot modification was made and
> > incorrectly replay dquot changes that might have occurred after the
> > start item committed but before quotaoff deactivated the quota.
> > 
> > With that context, we can make some small changes to the quotaoff
> > algorithm to provide the same general log ordering guarantee without
> > such a large window to create a log deadlock vector. Rather than
> > place a start item in the log for the duration of quotaoff
> > processing, we can quiesce the transaction subsystem up front to
> > guarantee that no further dquots are logged from that point forward.
> > IOW, we pause the transaction subsystem, commit the quotaoff start
> > and end items, deactivate the associated quota such that subsequent
> > transactions no longer modify associated dquots, and resume the
> > transaction subsystem. The transaction pause is somewhat of a heavy
> > weight operation, but quotaoff is already a rare, slow and
> > performance disruptive operation and the quiesce is only required
> > for two small transactions.
> > 
> > Altogether, this means that the dquot rele/purge sequence occurs
> > after the quotaoff end item has committed and thus can technically
> > fall off the end of the log. This is safe because the remaining
> > processing is in-core work that doesn't involve logging dquots and
> > we've guaranteed that no further dquots are modified by external
> > transactions. This allows quotaoff to complete without risking log
> > deadlock regardless of how much dquot processing is required.
> > 
> > Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx>
> > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_qm_syscalls.c | 133 +++++++++++++++++++--------------------
> >  fs/xfs/xfs_trans_dquot.c |   2 +
> >  2 files changed, 67 insertions(+), 68 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > index ca1b57d291dc..b8e55f4947bd 100644
> > --- a/fs/xfs/xfs_qm_syscalls.c
> > +++ b/fs/xfs/xfs_qm_syscalls.c
...
> > @@ -163,89 +162,87 @@ xfs_qm_scall_quotaoff(
> >  	}
> >  
> >  	/*
> > -	 * Nothing to do?  Don't complain. This happens when we're just
> > -	 * turning off quota enforcement.
> > +	 * Nothing to do? Don't complain. This happens when we're just turning
> > +	 * off quota enforcement.
> >  	 */
> >  	if ((mp->m_qflags & flags) == 0)
> >  		goto out_unlock;
> >  
> >  	/*
> > -	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
> > -	 * and synchronously. If we fail to write, we should abort the
> > -	 * operation as it cannot be recovered safely if we crash.
> > +	 * Quotaoff must deactivate the associated quota mode(s), release dquots
> > +	 * from inodes and purge them from the system all while the filesystem
> > +	 * remains active. We have two quotaoff log records that traditionally
> > +	 * bound the start and end of this sequence. This guarantees that no
> > +	 * dquots are modified after the end item hits the log, but quotaoff can
> > +	 * be time consuming and thus prone to deadlock because the start item
> > +	 * pins the tail the of log in the meantime (and we can't hold the end
> > +	 * transaction open across the dqrele scan).
> > +	 *
> > +	 * The critical aspect of correctly logging quotaoff is that no dquots
> > +	 * are modified after the quotaoff end item hits the on-disk log.
> > +	 * Otherwise the quotaoff can fall off the tail and log recovery can
> > +	 * replay incorrect data. Instead of letting the start item sit in the
> > +	 * log while quotaoff completes, we can provide the same guarantee via a
> > +	 * runtime barrier for dquot modifications. Specifically, we pause all
> > +	 * transactions on the system via the transaction subsystem lock, log
> > +	 * both start and end items (via sync transactions, which drains the
> > +	 * CIL), deactivate the quota, and then resume the transaction subsystem
> > +	 * while quotaoff completes.
> > +	 *
> > +	 * This is safe because the remaining quotaoff work is in-core cleanup
> > +	 * and all subsequent transactions should see the updated quota state
> > +	 * due to memory ordering provided by the lock. We also avoid deadlock
> > +	 * by committing both items sequentially with near exclusive access to
> > +	 * the transaction subsystem.
> >  	 */
> > +	percpu_down_write(&mp->m_trans_rwsem);
> > +
> >  	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> > -	if (error)
> > +	if (error) {
> > +		percpu_up_write(&mp->m_trans_rwsem);
> >  		goto out_unlock;
> > +	}
> >  
> > -	/*
> > -	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
> > -	 * to take care of the race between dqget and quotaoff. We don't take
> > -	 * any special locks to reset these bits. All processes need to check
> > -	 * these bits *after* taking inode lock(s) to see if the particular
> > -	 * quota type is in the process of being turned off. If *ACTIVE, it is
> > -	 * guaranteed that all dquot structures and all quotainode ptrs will all
> > -	 * stay valid as long as that inode is kept locked.
> > -	 *
> > -	 * There is no turning back after this.
> > -	 */
> >  	mp->m_qflags &= ~inactivate_flags;
> >  
> > +	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> > +	if (error) {
> > +		percpu_up_write(&mp->m_trans_rwsem);
> > +		/* We're screwed now. Shutdown is the only option. */
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +		goto out_unlock;
> > +	}
> > +
> > +	percpu_up_write(&mp->m_trans_rwsem);
> 
> Hmm, so if I read this correctly, you're changing what gets written to
> the log from:
> 
> 	<quotaoff intent>
> 	<transactions that maybe log dquots but we don't care>
> 	...now the qflags get cleared...
> 	<transactions that maybe log dquots but we don't care>
> 	...run around purging dquots...
> 	<transactions that maybe log dquots but we don't care>
> 	<quotaoff done>
> 
> to something that looks more like:
> 
> 	<transactions that log dquots>
> 	<quotaoff intent, all other transactions locked out>
> 	...now the qflags get cleared; no other transactions...
> 	<quotaoff done, turn off the lockout>
> 	<transactions that wont log dquots>
> 	...run around purging dquots...
> 

Yep, pretty much. The idea is to preserve the behavior that no dquots
are logged after the quotaoff end item to guarantee the log can never
hold a modified dquot after a previous quotaoff falls off the tail.
Rather than doing that by bounding the quotaoff operation with the two
log items, we introduce a runtime quiesce that deterministically
disables dquot activity and log the items after that point.

> Is that right?  I guess that makes sense, though it's sort of a pity
> that we now make every transaction grab a read lock.  Though I guess
> there's not much that can be done about that; it's better than pinning
> the log tail.  I don't even think there's a good way to relog that
> quotaoff-intent item, is there?  Since you'd have to, I dunno, do all
> the checks that xfs_defer_relog() does to decide if it should relog an
> intent item?
> 

Yeah, well that was a purpose of the previous automatic relog bits.
Short of that, I'm not aware of a great way to keep the start intent
moving. We can't hold a transaction open across the scan(s), so that
also rules out turning quotaoff into a dfop, for example. We could try
to relog the intent at random points, but that looks to me like a bunch
of busy work to break up the scans for no guarantee in return that the
relog transaction wouldn't still deadlock anyways.

Brian

> --D
> 
> > +
> >  	/*
> > -	 * Give back all the dquot reference(s) held by inodes.
> > -	 * Here we go thru every single incore inode in this file system, and
> > -	 * do a dqrele on the i_udquot/i_gdquot that it may have.
> > -	 * Essentially, as long as somebody has an inode locked, this guarantees
> > -	 * that quotas will not be turned off. This is handy because in a
> > -	 * transaction once we lock the inode(s) and check for quotaon, we can
> > -	 * depend on the quota inodes (and other things) being valid as long as
> > -	 * we keep the lock(s).
> > +	 * Release dquot references held by inodes. Technically some contexts
> > +	 * might not pick up the quota state change until the inode lock is
> > +	 * cycled if there is no transaction. We don't care about that above
> > +	 * because a dquot can't be logged without a transaction and we can't
> > +	 * release/purge a dquot here until we've cycled the locks of all inodes
> > +	 * that reference it.
> >  	 */
> >  	xfs_qm_dqrele_all_inodes(mp, flags);
> >  
> >  	/*
> >  	 * Next we make the changes in the quota flag in the mount struct.
> > -	 * This isn't protected by a particular lock directly, because we
> > -	 * don't want to take a mrlock every time we depend on quotas being on.
> > +	 * This isn't protected by a particular lock directly, because we don't
> > +	 * want to take a mrlock every time we depend on quotas being on.
> >  	 */
> >  	mp->m_qflags &= ~flags;
> >  
> > -	/*
> > -	 * Go through all the dquots of this file system and purge them,
> > -	 * according to what was turned off.
> > -	 */
> > +	/* purge all deactivated dquots from the filesystem */
> >  	xfs_qm_dqpurge_all(mp, dqtype);
> >  
> > -	/*
> > -	 * Transactions that had started before ACTIVE state bit was cleared
> > -	 * could have logged many dquots, so they'd have higher LSNs than
> > -	 * the first QUOTAOFF log record does. If we happen to crash when
> > -	 * the tail of the log has gone past the QUOTAOFF record, but
> > -	 * before the last dquot modification, those dquots __will__
> > -	 * recover, and that's not good.
> > -	 *
> > -	 * So, we have QUOTAOFF start and end logitems; the start
> > -	 * logitem won't get overwritten until the end logitem appears...
> > -	 */
> > -	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> > -	if (error) {
> > -		/* We're screwed now. Shutdown is the only option. */
> > -		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > -		goto out_unlock;
> > -	}
> > -
> > -	/*
> > -	 * If all quotas are completely turned off, close shop.
> > -	 */
> > +	/* if all quotas are completely turned off, close shop */
> >  	if (mp->m_qflags == 0) {
> >  		mutex_unlock(&q->qi_quotaofflock);
> >  		xfs_qm_destroy_quotainfo(mp);
> >  		return 0;
> >  	}
> >  
> > -	/*
> > -	 * Release our quotainode references if we don't need them anymore.
> > -	 */
> > +	/* release our quotainode references if we don't need them anymore */
> >  	if ((dqtype & XFS_QMOPT_UQUOTA) && q->qi_uquotaip) {
> >  		xfs_irele(q->qi_uquotaip);
> >  		q->qi_uquotaip = NULL;
> > diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> > index 547ba824542e..9839b83e732a 100644
> > --- a/fs/xfs/xfs_trans_dquot.c
> > +++ b/fs/xfs/xfs_trans_dquot.c
> > @@ -52,6 +52,8 @@ xfs_trans_log_dquot(
> >  	struct xfs_dquot	*dqp)
> >  {
> >  	ASSERT(XFS_DQ_IS_LOCKED(dqp));
> > +	/* quotaoff expects no dquots logged after deactivation */
> > +	ASSERT(xfs_this_quota_on(tp->t_mountp, xfs_dquot_type(dqp)));
> >  
> >  	/* Upgrade the dquot to bigtime format if possible. */
> >  	if (dqp->q_id != 0 &&
> > -- 
> > 2.25.4
> > 
>