On Fri, Apr 07, 2017 at 01:07:25PM -0700, Darrick J. Wong wrote: > On Fri, Apr 07, 2017 at 08:06:31AM -0400, Brian Foster wrote: > > On Thu, Apr 06, 2017 at 03:34:17PM -0700, Darrick J. Wong wrote: > > > On Fri, Feb 24, 2017 at 02:53:20PM -0500, Brian Foster wrote: > > > > Reclaim during quotacheck can lead to deadlocks on the dquot flush > > > > lock: > > > > > > > > - Quotacheck populates a local delwri queue with the physical dquot > > > > buffers. > > > > - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and > > > > dirties all of the dquots. > > > > - Reclaim kicks in and attempts to flush a dquot whose buffer is > > > > already queud on the quotacheck queue. The flush succeeds but > > > > queueing to the reclaim delwri queue fails as the backing buffer is > > > > already queued. The flush unlock is now deferred to I/O completion > > > > of the buffer from the quotacheck queue. > > > > > > <typing aloud here, trying to follow along...> > > > > > > If I'm reading this correctly, this is the _buf_delwri_queue call > > > towards the end of the _DQ_IS_DIRTY clause in _qm_dquot_isolate? Once > > > the dquot is flushed we fail to add it to isol->buffers because it's > > > already on quotacheck's buffer_list, which means that the > > > _buf_delwri_submit call in _qm_shrink_scan never sees our dirty dquot. > > > > > > Therefore, the dquot doesn't get written and the flush lock stays > > > locked... > > > > > > > Yes, exactly. > > > > > > - The dqadjust bulkstat continues and dirties the recently flushed > > > > dquot once again. > > > > - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires > > > > the flush lock to update the backing buffers with the in-core > > > > recalculated values. It deadlocks on the redirtied dquot as the > > > > flush lock was already acquired by reclaim, but the buffer resides > > > > on the local delwri queue which isn't submitted until the end of > > > > quotacheck. > > > > > > ...until quotacheck tries to relock it and deadlocks, like you said. > > > Ok, I think I understand the deadlock... > > > > > > > Yep. > > > > > > This is reproduced by running quotacheck on a filesystem with a > > > > couple million inodes in low memory (512MB-1GB) situations. This is > > > > a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write > > > > buffer lists"), which removed a trylock and buffer I/O submission > > > > from the quotacheck dquot flush sequence. > > > > > > > > Quotacheck first resets and collects the physical dquot buffers in a > > > > delwri queue. Then, it traverses the filesystem inodes via bulkstat, > > > > updates the in-core dquots, flushes the corrected dquots to the > > > > backing buffers and finally submits the delwri queue for I/O. Since > > > > the backing buffers are queued across the entire quotacheck > > > > operation, dquot reclaim cannot possibly complete a dquot flush > > > > before quotacheck completes. > > > > > > > > Therefore, quotacheck must submit the buffer for I/O in order to > > > > cycle the flush lock and flush the dirty in-core dquot to the > > > > buffer. Add a delwri queue buffer push mechanism to submit an > > > > individual buffer for I/O without losing the delwri queue status and > > > > use it from quotacheck to avoid the deadlock. This restores > > > > quotacheck behavior to as before the regression was introduced. > > > > > > ...and therefore why this patch changes quotacheck to check if the > > > dquot's flush lock is already held and if so, to call > > > _buf_delwri_submit_buffers so that the io completion will unlock the > > > flush lock and try again. > > > > > > > Yes, though only for the specific buffer. > > > > > When we return to _qm_flush_one, the dquot is no longer dirty and we're > > > done. I /think/ this looks ok, though I also think I want to hear if > > > either Christian Affolter or Martin Svec have had any problems with > > > these two patches applied. My current feeling is that this is a bit > > > late for 4.12, but I'd certainly put it in -washme for wider testing. > > > If anyone thinks this needs to happen sooner, please speak up. > > > > > > > The dquot is still dirty and needs to be flushed on the second go around > > (otherwise we wouldn't have needed the flush lock here on the first > > try). The issue is basically that the dquot was adjusted (dirtied) at > > some point during the bulkstat scan, reclaimed and flushed, and then > > subsequently dirtied again such that another flush is required via > > qm_flush_one(). The buffer push frees up the held flush lock so that > > final flush can proceed. > > Ok. I'd thought that the dquot could still be dirty when we got to there. > > > BTW, what's the reasoning for being late to 4.12? I actually don't have > > much of an opinion which release this goes into, so I guess my question > > is more logistical due to the fact that up until recently I haven't paid > > much attention to the upstream release cycles. ;P The 4.12 merge window > > hasn't opened as we are still on 4.11-rc5, right? Is the concern then > > that the merge window is close enough that this has missed 4.12 > > linux-next soak time? > > We're still at least a week or two away from the merge window opening, > but I want to hear from the two bug reporters that this patch series has > made their symptoms go away, and that they've mounted enough times to be > reasonably confident that they haven't simply gotten lucky and not hit > the deadlock. I don't know that both of those things will happen in the > time we have left, and seeing as the frequency of reports is pretty low, > it doesn't seem like we have to rush it into 4.12. > > > Also, can you clarify the semantics of the -washme thing? ;) That > > doesn't necessarily mean the patch has been "merged." Rather, it might > > be suitable for the next for-next (4.13) provided nothing explodes by > > then, correct? > > Yep. > > > I'm also a little curious on what extra testing that entails, if any, > > though perhaps this topic is better suited by an independent thread.. > > Same testing as for-next gets (nightly auto group runs, ltp every now > and then, various $magic workloads); it's really just a place to land > patches that have been code reviewed but I'm not yet confident enough > about to put into for-next for the next cycle. > Ok, thanks. Note there will be another version incoming based on Dave's feedback. Otherwise sounds good. Brian > --D > > > > > > Sorry this one took me a while to get to. :/ > > > > > > Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > > No problem. Thanks for review... > > > > Brian > > > > > --D > > > > > > > Reported-by: Martin Svec <martin.svec@xxxxxxxx> > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > > --- > > > > fs/xfs/xfs_buf.c | 37 +++++++++++++++++++++++++++++++++++++ > > > > fs/xfs/xfs_buf.h | 1 + > > > > fs/xfs/xfs_qm.c | 28 +++++++++++++++++++++++++++- > > > > fs/xfs/xfs_trace.h | 1 + > > > > 4 files changed, 66 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > index e566510..e97cf56 100644 > > > > --- a/fs/xfs/xfs_buf.c > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -2011,6 +2011,43 @@ xfs_buf_delwri_submit( > > > > return error; > > > > } > > > > > > > > +int > > > > +xfs_buf_delwri_pushbuf( > > > > + struct xfs_buf *bp, > > > > + struct list_head *buffer_list) > > > > +{ > > > > + LIST_HEAD (submit_list); > > > > + int error; > > > > + > > > > + ASSERT(xfs_buf_islocked(bp)); > > > > + ASSERT(bp->b_flags & _XBF_DELWRI_Q); > > > > + > > > > + trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_); > > > > + > > > > + /* > > > > + * Move the buffer to an empty list and submit. Pass the original list > > > > + * as the wait list so delwri submission moves the buf back to it before > > > > + * it is submitted (and thus before it is unlocked). This means the > > > > + * buffer cannot be placed on another list while we wait for it. > > > > + */ > > > > + list_move(&bp->b_list, &submit_list); > > > > + xfs_buf_unlock(bp); > > > > + > > > > + xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > > > + > > > > + /* > > > > + * Lock the buffer to wait for I/O completion. It's already held on the > > > > + * original list, so all we have to do is reset the delwri queue flag > > > > + * that was cleared by delwri submission. > > > > + */ > > > > + xfs_buf_lock(bp); > > > > + error = bp->b_error; > > > > + bp->b_flags |= _XBF_DELWRI_Q; > > > > + xfs_buf_unlock(bp); > > > > + > > > > + return error; > > > > +} > > > > + > > > > int __init > > > > xfs_buf_init(void) > > > > { > > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > > > index 8a9d3a9..cd74216 100644 > > > > --- a/fs/xfs/xfs_buf.h > > > > +++ b/fs/xfs/xfs_buf.h > > > > @@ -334,6 +334,7 @@ extern void xfs_buf_stale(struct xfs_buf *bp); > > > > extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *); > > > > extern int xfs_buf_delwri_submit(struct list_head *); > > > > extern int xfs_buf_delwri_submit_nowait(struct list_head *); > > > > +extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *); > > > > > > > > /* Buffer Daemon Setup Routines */ > > > > extern int xfs_buf_init(void); > > > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c > > > > index 4ff993c..3815ed3 100644 > > > > --- a/fs/xfs/xfs_qm.c > > > > +++ b/fs/xfs/xfs_qm.c > > > > @@ -1247,6 +1247,7 @@ xfs_qm_flush_one( > > > > struct xfs_dquot *dqp, > > > > void *data) > > > > { > > > > + struct xfs_mount *mp = dqp->q_mount; > > > > struct list_head *buffer_list = data; > > > > struct xfs_buf *bp = NULL; > > > > int error = 0; > > > > @@ -1257,7 +1258,32 @@ xfs_qm_flush_one( > > > > if (!XFS_DQ_IS_DIRTY(dqp)) > > > > goto out_unlock; > > > > > > > > - xfs_dqflock(dqp); > > > > + /* > > > > + * The only way the dquot is already flush locked by the time quotacheck > > > > + * gets here is if reclaim flushed it before the dqadjust walk dirtied > > > > + * it for the final time. Quotacheck collects all dquot bufs in the > > > > + * local delwri queue before dquots are dirtied, so reclaim can't have > > > > + * possibly queued it for I/O. The only way out is to push the buffer to > > > > + * cycle the flush lock. > > > > + */ > > > > + if (!xfs_dqflock_nowait(dqp)) { > > > > + /* buf is pinned in-core by delwri list */ > > > > + DEFINE_SINGLE_BUF_MAP(map, dqp->q_blkno, > > > > + mp->m_quotainfo->qi_dqchunklen); > > > > + bp = _xfs_buf_find(mp->m_ddev_targp, &map, 1, 0, NULL); > > > > + if (!bp) { > > > > + error = -EINVAL; > > > > + goto out_unlock; > > > > + } > > > > + > > > > + /* delwri_pushbuf drops the buf lock */ > > > > + xfs_buf_delwri_pushbuf(bp, buffer_list); > > > > + xfs_buf_rele(bp); > > > > + > > > > + error = -EAGAIN; > > > > + goto out_unlock; > > > > + } > > > > + > > > > error = xfs_qm_dqflush(dqp, &bp); > > > > if (error) > > > > goto out_unlock; > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > > index fb7555e..c8d28a9 100644 > > > > --- a/fs/xfs/xfs_trace.h > > > > +++ b/fs/xfs/xfs_trace.h > > > > @@ -365,6 +365,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done); > > > > DEFINE_BUF_EVENT(xfs_buf_delwri_queue); > > > > DEFINE_BUF_EVENT(xfs_buf_delwri_queued); > > > > DEFINE_BUF_EVENT(xfs_buf_delwri_split); > > > > +DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf); > > > > DEFINE_BUF_EVENT(xfs_buf_get_uncached); > > > > DEFINE_BUF_EVENT(xfs_buf_item_relse); > > > > DEFINE_BUF_EVENT(xfs_buf_item_iodone_async); > > > > -- > > > > 2.7.4 > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html