On Wed, Sep 30, 2020 at 09:30:30AM -0400, Brian Foster wrote: > On Wed, Sep 30, 2020 at 06:48:08AM +1000, Dave Chinner wrote: > > On Tue, Sep 29, 2020 at 10:12:28AM -0400, Brian Foster wrote: > > > The quotaoff operation logs two log items. The start item is > > > committed first, followed by the bulk of the in-core quotaoff > > > processing, and then the quotaoff end item is committed to release > > > the start item from the log. The problem with this mechanism is that > > > quite a bit of processing can be required to release dquots from all > > > in-core inodes and subsequently flush/purge all dquots in the > > > system. This processing work doesn't generally generate much log > > > traffic itself, but the start item pins the tail of the log. If an > > > external workload consumes the remaining log space before the > > > transaction for the end item is allocated, a log deadlock can occur. > > > > > > The purpose of the separate start and end log items is primarily to > > > ensure that log recovery does not incorrectly recover dquot data > > > after an fs crash where a quotaoff was in progress. If we only > > > logged a single quotaoff item, for example, it could fall behind the > > > tail of the log before the last dquot modification was made and > > > incorrectly replay dquot changes that might have occurred after the > > > start item committed but before quotaoff deactivated the quota. > > > > > > With that context, we can make some small changes to the quotaoff > > > algorithm to provide the same general log ordering guarantee without > > > such a large window to create a log deadlock vector. Rather than > > > place a start item in the log for the duration of quotaoff > > > processing, we can quiesce the transaction subsystem up front to > > > guarantee that no further dquots are logged from that point forward. > > > IOW, we freeze the transaction subsystem, commit the start item in a > > > synchronous transaction that forces the log, deactivate the > > > associated quota such that subsequent transactions no longer modify > > > associated dquots, unfreeze the transaction subsystem and finally > > > commit the quotaoff end item. The transaction freeze is somewhat of > > > a heavy weight operation, but quotaoff is already a rare, slow and > > > performance disruptive operation. > > > > > > Altogether, this means that the dquot rele/purge sequence occurs > > > after the quotaoff end item has committed and thus can technically > > > fall off the active range of the log. This is safe because the > > > remaining processing is all in-core work that doesn't involve > > > logging dquots and we've guaranteed that no further dquots can be > > > modified by external transactions. This allows quotaoff to complete > > > without risking log deadlock regardless of how much dquot processing > > > is required. > > > > > > Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx> > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > --- > > > fs/xfs/xfs_qm_syscalls.c | 36 ++++++++++++++++-------------------- > > > fs/xfs/xfs_trans_dquot.c | 2 ++ > > > 2 files changed, 18 insertions(+), 20 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c > > > index ca1b57d291dc..97844f33f70f 100644 > > > --- a/fs/xfs/xfs_qm_syscalls.c > > > +++ b/fs/xfs/xfs_qm_syscalls.c > > > @@ -29,7 +29,8 @@ xfs_qm_log_quotaoff( > > > int error; > > > struct xfs_qoff_logitem *qoffi; > > > > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp); > > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, > > > + XFS_TRANS_NO_WRITECOUNT, &tp); > > > if (error) > > > goto out; > > > > > > @@ -169,14 +170,18 @@ xfs_qm_scall_quotaoff( > > > if ((mp->m_qflags & flags) == 0) > > > goto out_unlock; > > > > > > + xfs_trans_freeze(mp); > > > + > > > /* > > > * Write the LI_QUOTAOFF log record, and do SB changes atomically, > > > * and synchronously. If we fail to write, we should abort the > > > * operation as it cannot be recovered safely if we crash. > > > */ > > > error = xfs_qm_log_quotaoff(mp, &qoffstart, flags); > > > - if (error) > > > + if (error) { > > > + xfs_trans_unfreeze(mp); > > > goto out_unlock; > > > + } > > > > > > /* > > > * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct > > > @@ -191,6 +196,15 @@ xfs_qm_scall_quotaoff( > > > */ > > > mp->m_qflags &= ~inactivate_flags; > > > > > > + xfs_trans_unfreeze(mp); > > > + > > > + error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags); > > > + if (error) { > > > + /* We're screwed now. Shutdown is the only option. */ > > > + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); > > > + goto out_unlock; > > > + } > > > + > > > > The m_qflags update is racy. There's no memory barriers or locks > > here to order the mp->m_qflags update with other CPUs, so there's no > > guarantee that an incoming transaction will see the m_qflags change > > that disables quotas. > > > > The transaction freeze is based on a percpu rwsem (I've replaced the > abuse of the freeze lock with our own lock in patch 2). That documents a > memory barrier on the read side, but that appears to be down in > __percpu_down_read_trylock() (or percpu_rwsem_wait()). What isn't clear > to me is whether the higher level RCU fast path has similar semantics, > particularly in the case where a racing transaction allocates just after > the transaction unfreeze... > > Hmm... it seems not, but the rcu sync state managed by the lock appears > to be reentrant. I'm wondering if we could do something like grab the > write lock in the quotaoff path, re-enter the rcu sync state explicitly, > do the quotaoff bits, unlock, then hold the extra rcu sync entry until > the dqrele scan completes. That means transactions would temporarily hit > the memory barrier via the read lock until we can fall back to the > current ilock ordering scheme. OTOH, I don't see any precedent for > something like that elsewhere in the kernel, though cgroups does > something very similar by forcing the slow path unconditionally via > rcu_sync_enter_start(). This would presumably just do that > dynamically/temporarily. Thoughts? > After digging further into it, this doesn't appear to be a viable option because the associated rcu_sync_*() symbols aren't exported.. Brian > For reference, the current quotaoff implementation appears to rely on > the order of the flag update before the dqrele scan as it assumes racing > transactions check quota active state under the ilock: > > /* > * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct > * to take care of the race between dqget and quotaoff. We don't take > * any special locks to reset these bits. All processes need to check > * these bits *after* taking inode lock(s) to see if the particular > * quota type is in the process of being turned off. If *ACTIVE, it is > * guaranteed that all dquot structures and all quotainode ptrs will all > * stay valid as long as that inode is kept locked. > * > * There is no turning back after this. > */ > mp->m_qflags &= ~inactivate_flags; > > IOW, since quotaoff scans and cycles every ilock, the dqrele scan > essentially serves as the barrier for racing transactions checking quota > active state (under ilock). > > > Also, we can now race with other transaction reservations to log the > > quota off end item, which means that if there are enough pending > > transactions reservations queued before the quotaoff_end transaction > > is started, the quotaoff item can pin the tail of the log and we > > deadlock. > > > > That thought crossed my mind, but I didn't think it was possible in > practice. It's easy enough to commit the end transaction with > transactions paused as well so I'll make that change. > > > That was the reason why I logged both quota off items in the same > > transaction in the prototype code I send out - we have to log both > > quota-off items while the transaction subsystem is quiesced > > otherwise we don't fix the original problem (that the quotaoff > > item can pin the tail of the log and deadlock)..... > > > > The prototype code logged both transactions within the quota transaction > quiesce window, not both items in the same transaction. IIRC, we > discussed eventually doing that along with some other potential > cleanups, but I'm not looking to do any of that before the fundamental > algorithm rework settles... > > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > >