On Tue, Feb 14, 2012 at 09:29:29PM -0500, Christoph Hellwig wrote: > Instead of keeping a separate per-filesystem list of dquots we can walk > the radix tree for the two places where we need to iterate all quota > structures. And with the new radix tree iterator code being worked on, this will become even simpler soon... ..... > @@ -1025,16 +1017,23 @@ xfs_dqlock2( > > /* > * Take a dquot out of the mount's dqlist as well as the hashlist. This is remove from the quota tree. No hashlist anymore, either... > - * called via unmount as well as quotaoff, and the purge will always succeed. > + * called via unmount as well as quotaoff. > */ > -void > +int > xfs_qm_dqpurge( > - struct xfs_dquot *dqp) > + struct xfs_dquot *dqp, > + int flags) > { > struct xfs_mount *mp = dqp->q_mount; > struct xfs_quotainfo *qi = mp->m_quotainfo; > > xfs_dqlock(dqp); > + if ((dqp->dq_flags & XFS_DQ_FREEING) || dqp->q_nrefs != 0) { > + xfs_dqlock(dqp); xfs_dqunlock()? > +++ xfs/fs/xfs/xfs_qm.c 2012-02-12 13:22:33.326936637 -0800 > @@ -308,172 +308,157 @@ xfs_qm_unmount_quotas( > } > > /* > - * Flush all dquots of the given file system to disk. The dquots are > - * _not_ purged from memory here, just their data written to disk. > + * The quota lookup is done in batches to keep the amount of lock traffic and > + * radix tree lookups to a minimum. The batch size is a trade off between > + * lookup reduction and stack usage. Given the way the locking works here, the gang lookup doesn't really do anythign for reducing lock traffic. It reduces lookup overhead a bit, but seeing as we don't drop the tree lock while executing operations on each dquot I don't see much advantage in the complexity of batched lookups.... > */ > +#define XFS_DQ_LOOKUP_BATCH 32 > + > STATIC int > -xfs_qm_dqflush_all( > - struct xfs_mount *mp) > -{ > - struct xfs_quotainfo *q = mp->m_quotainfo; > - int recl; > - struct xfs_dquot *dqp; > - int error; > +xfs_qm_dquot_walk( > + struct xfs_mount *mp, > + int type, > + int (*execute)(struct xfs_dquot *dqp, int flags), > + int flags) > +{ > + struct radix_tree_root *tree = XFS_DQUOT_TREE(mp, type); > + uint32_t first_index; > + int last_error = 0; > + int skipped; > + int nr_found; > + > +restart: > + skipped = 0; > + first_index = 0; > + nr_found = 0; > > - if (!q) > - return 0; > -again: > - mutex_lock(&q->qi_dqlist_lock); > - list_for_each_entry(dqp, &q->qi_dqlist, q_mplist) { > - xfs_dqlock(dqp); > - if ((dqp->dq_flags & XFS_DQ_FREEING) || > - !XFS_DQ_IS_DIRTY(dqp)) { > - xfs_dqunlock(dqp); > - continue; > - } > + mutex_lock(&mp->m_quotainfo->qi_tree_lock); > + do { > + struct xfs_dquot *batch[XFS_DQ_LOOKUP_BATCH]; > + int error = 0; > + int i; > + > + nr_found = radix_tree_gang_lookup(tree, (void **)batch, > + first_index, XFS_DQ_LOOKUP_BATCH); > + if (!nr_found) > + break; > > - /* XXX a sentinel would be better */ > - recl = q->qi_dqreclaims; > - if (!xfs_dqflock_nowait(dqp)) { > - /* > - * If we can't grab the flush lock then check > - * to see if the dquot has been flushed delayed > - * write. If so, grab its buffer and send it > - * out immediately. We'll be able to acquire > - * the flush lock when the I/O completes. > - */ > - xfs_dqflock_pushbuf_wait(dqp); > + for (i = 0; i < nr_found; i++) { > + struct xfs_dquot *dqp = batch[i]; > + > + first_index = be32_to_cpu(dqp->q_core.d_id) + 1; > + > + error = execute(batch[i], flags); > + if (error == EAGAIN) { > + skipped++; > + continue; > + } > + if (error && last_error != EFSCORRUPTED) > + last_error = error; > + } > + /* bail out if the filesystem is corrupted. */ > + if (error == EFSCORRUPTED) { > + skipped = 0; > + break; > } The problem I see with this is that it holds the qi_tree_lock over the entire walk - it is not dropped anywhere it there is no reschedule pressure. Hence all lookups will stall while a walk is in progress. Given a walk can block on IO or dquot locks, this could mean that a walk holds off lookups for quite some time. > - /* > - * Let go of the mplist lock. We don't want to hold it > - * across a disk write. > - */ > - mutex_unlock(&q->qi_dqlist_lock); > - error = xfs_qm_dqflush(dqp, 0); > - xfs_dqunlock(dqp); > - if (error) > - return error; > > - mutex_lock(&q->qi_dqlist_lock); > - if (recl != q->qi_dqreclaims) { > - mutex_unlock(&q->qi_dqlist_lock); > - /* XXX restart limit */ > - goto again; > + if (need_resched()) { > + mutex_unlock(&mp->m_quotainfo->qi_tree_lock); > + cond_resched(); > + mutex_lock(&mp->m_quotainfo->qi_tree_lock); > } While this plays nice with other threads that require low latency, it doesn't solve the "hold the lock for the entire walk" problem when lookups are trying to get the qi tree lock as need_resched state is only triggered at the scheduler level and not by lock waiters. The problem doesn't exist with the current code, because lookups are done under the hash lock, not the list lock. Now both lookup and all-dquot-walking functioanlity are under the same lock, so hold-offs definitely need thinking about. Do we need to hold the tree lock over the execute() function - I can see the advantages for the purge case, but for the flush case it is less clear. Perhaps unconditionally dropping the tree lock after every batch would mitigate this problem - after all we already have the index we need to do the next lookup from and that doesn't change if we drop the lock.... > +STATIC int > +xfs_qm_flush_one( > + struct xfs_dquot *dqp, > + int flags) > { > - struct xfs_quotainfo *q = mp->m_quotainfo; > - struct xfs_dquot *dqp, *gdqp; > + int error = 0; > > - again: > - ASSERT(mutex_is_locked(&q->qi_dqlist_lock)); > - list_for_each_entry(dqp, &q->qi_dqlist, q_mplist) { > - xfs_dqlock(dqp); > - if (dqp->dq_flags & XFS_DQ_FREEING) { > - xfs_dqunlock(dqp); > - mutex_unlock(&q->qi_dqlist_lock); > - delay(1); > - mutex_lock(&q->qi_dqlist_lock); > - goto again; > - } > + xfs_dqlock(dqp); > + if (dqp->dq_flags & XFS_DQ_FREEING) > + goto out_unlock; > + if (!XFS_DQ_IS_DIRTY(dqp)) > + goto out_unlock; > > - gdqp = dqp->q_gdquot; > - if (gdqp) > - dqp->q_gdquot = NULL; > - xfs_dqunlock(dqp); > + if (!xfs_dqflock_nowait(dqp)) > + xfs_dqflock_pushbuf_wait(dqp); For example, this blocks holding the tree lock waiting for IO completion. > -xfs_qm_dqpurge_int( > +xfs_qm_detach_gdquot( > + struct xfs_dquot *dqp, > + int flags) > +{ > + struct xfs_dquot *gdqp; > + > + xfs_dqlock(dqp); > + /* XXX(hch): should we bother with freeeing dquots here? */ > + if (dqp->dq_flags & XFS_DQ_FREEING) { > + xfs_dqunlock(dqp); > + return 0; > + } Better to be safe, I think, rather than leave a landmine for future modifications to trip over... ..... > + if (!error && (flags & XFS_QMOPT_UQUOTA)) > + error = xfs_qm_dquot_walk(mp, XFS_DQ_USER, xfs_qm_dqpurge, 0); > + if (!error && (flags & XFS_QMOPT_GQUOTA)) > + error = xfs_qm_dquot_walk(mp, XFS_DQ_GROUP, xfs_qm_dqpurge, 0); > + if (!error && (flags & XFS_QMOPT_PQUOTA)) > + error = xfs_qm_dquot_walk(mp, XFS_DQ_PROJ, xfs_qm_dqpurge, 0); > + return error; Seeing as it is a purge, even on an error I'd still try to purge all trees. Indeed, what happens in the case of a filesystem shutdown here? > + * We've made all the changes that we need to make incore. Flush them > + * down to disk buffers if everything was updated successfully. > */ > - if (!error) > - error = xfs_qm_dqflush_all(mp); > + if (!error && XFS_IS_UQUOTA_ON(mp)) > + error = xfs_qm_dquot_walk(mp, XFS_DQ_USER, xfs_qm_flush_one, 0); > + if (!error && XFS_IS_GQUOTA_ON(mp)) > + error = xfs_qm_dquot_walk(mp, XFS_DQ_GROUP, xfs_qm_flush_one, 0); > + if (!error && XFS_IS_PQUOTA_ON(mp)) > + error = xfs_qm_dquot_walk(mp, XFS_DQ_PROJ, xfs_qm_flush_one, 0); Same here - I'd still try to flush each tree even if one tree gets an error... Hmmmm- all the walk cases pass 0 as their flags. Are they used in later patches? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs