On Mon, Jun 18, 2018 at 10:21:22PM -0700, Darrick J. Wong wrote: > On Fri, Jun 15, 2018 at 02:05:36PM -0400, Brian Foster wrote: > > If a delwri queue occurs of a buffer that sits on a delwri queue > > wait list, the queue sets _XBF_DELWRI_Q without changing the state > > of ->b_list. This occurs, for example, if another thread beats the > > current delwri waiter thread to the buffer lock after I/O > > completion. Once the waiter acquires the lock, it removes the buffer > > from the wait list and leaves a buffer with _XBF_DELWRI_Q set but > > not populated on a list. This results in a lost buffer submission > > and in turn can result in assert failures due to _XBF_DELWRI_Q being > > set on buffer reclaim or filesystem lockups if the buffer happens to > > cover an item in the AIL. > > > > This problem has been reproduced by repeated iterations of xfs/305 > > on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty > > dquot reclaim races with an xfsaild push of a separate dquot backed > > by the same buffer such that the buffer sits on the reclaim wait > > list at the time xfsaild attempts to queue it. Since the latter > > dquot has been flush locked but the underlying buffer not submitted > > for I/O, the dquot pins the AIL and causes the filesystem to > > livelock. > > > > This race is essentially made possible by the buffer lock cycle > > involved with waiting on a synchronous delwri queue submission. > > Close the race by using synchronous buffer I/O for respective delwri > > queue submission. This means the buffer remains locked across the > > I/O and so is inaccessible from other contexts while in the > > intermediate wait list state. The sync buffer I/O wait mechanism is > > factored into a helper such that sync delwri buffer submission and > > serialization are batched operations. > > > > Designed-by: Dave Chinner <dchinner@xxxxxxxxxx> > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > --- > > fs/xfs/xfs_buf.c | 78 +++++++++++++++++++++++++----------------------- > > 1 file changed, 41 insertions(+), 37 deletions(-) > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index 7b0f7c79cd62..b5c0f6949b30 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -1531,6 +1531,20 @@ xfs_buf_submit( > > xfs_buf_rele(bp); > > } > > > > +/* > > + * Wait for I/O completion of a sync buffer and return the I/O error code. > > + */ > > +static int > > +xfs_buf_iowait( > > + struct xfs_buf *bp) > > +{ > > + trace_xfs_buf_iowait(bp, _RET_IP_); > > + wait_for_completion(&bp->b_iowait); > > + trace_xfs_buf_iowait_done(bp, _RET_IP_); > > + > > + return bp->b_error; > > +} > > + > > /* > > * Synchronous buffer IO submission path, read or write. > > */ > > @@ -1553,12 +1567,7 @@ xfs_buf_submit_wait( > > error = __xfs_buf_submit(bp); > > if (error) > > goto out; > > - > > - /* wait for completion before gathering the error from the buffer */ > > - trace_xfs_buf_iowait(bp, _RET_IP_); > > - wait_for_completion(&bp->b_iowait); > > - trace_xfs_buf_iowait_done(bp, _RET_IP_); > > - error = bp->b_error; > > + error = xfs_buf_iowait(bp); > > > > out: > > /* > > @@ -1961,16 +1970,11 @@ xfs_buf_cmp( > > } > > > > /* > > - * submit buffers for write. > > - * > > - * When we have a large buffer list, we do not want to hold all the buffers > > - * locked while we block on the request queue waiting for IO dispatch. To avoid > > - * this problem, we lock and submit buffers in groups of 50, thereby minimising > > - * the lock hold times for lists which may contain thousands of objects. > > - * > > - * To do this, we sort the buffer list before we walk the list to lock and > > - * submit buffers, and we plug and unplug around each group of buffers we > > - * submit. > > + * Submit buffers for write. If wait_list is specified, the buffers are > > + * submitted using sync I/O and placed on the wait list such that the caller can > > + * iowait each buffer. Otherwise async I/O is used and the buffers are released > > + * at I/O completion time. In either case, buffers remain locked until I/O > > + * completes and the buffer is released from the queue. > > */ > > static int > > xfs_buf_delwri_submit_buffers( > > @@ -2012,21 +2016,22 @@ xfs_buf_delwri_submit_buffers( > > trace_xfs_buf_delwri_split(bp, _RET_IP_); > > > > /* > > - * We do all IO submission async. This means if we need > > - * to wait for IO completion we need to take an extra > > - * reference so the buffer is still valid on the other > > - * side. We need to move the buffer onto the io_list > > - * at this point so the caller can still access it. > > + * If we have a wait list, each buffer (and associated delwri > > + * queue reference) transfers to it and is submitted > > + * synchronously. Otherwise, drop the buffer from the delwri > > + * queue and submit async. > > */ > > bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL); > > - bp->b_flags |= XBF_WRITE | XBF_ASYNC; > > + bp->b_flags |= XBF_WRITE; > > if (wait_list) { > > - xfs_buf_hold(bp); > > + bp->b_flags &= ~XBF_ASYNC; > > list_move_tail(&bp->b_list, wait_list); > > - } else > > + __xfs_buf_submit(bp); > > + } else { > > + bp->b_flags |= XBF_ASYNC; > > list_del_init(&bp->b_list); > > - > > - xfs_buf_submit(bp); > > + xfs_buf_submit(bp); > > + } > > } > > blk_finish_plug(&plug); > > > > @@ -2073,8 +2078,11 @@ xfs_buf_delwri_submit( > > > > list_del_init(&bp->b_list); > > > > - /* locking the buffer will wait for async IO completion. */ > > - xfs_buf_lock(bp); > > + /* > > + * Wait on the locked buffer, check for errors and unlock and > > + * release the delwri queue reference. > > + */ > > + xfs_buf_iowait(bp); > > error2 = bp->b_error; > > error2 = xfs_buf_iowait()? > > > xfs_buf_relse(bp); > > if (!error) > > @@ -2121,22 +2129,18 @@ xfs_buf_delwri_pushbuf( > > > > /* > > * Delwri submission clears the DELWRI_Q buffer flag and returns with > > - * the buffer on the wait list with an associated reference. Rather than > > + * the buffer on the wait list with the original reference. Rather than > > * bounce the buffer from a local wait list back to the original list > > * after I/O completion, reuse the original list as the wait list. > > */ > > xfs_buf_delwri_submit_buffers(&submit_list, buffer_list); > > > > /* > > - * The buffer is now under I/O and wait listed as during typical delwri > > - * submission. Lock the buffer to wait for I/O completion. Rather than > > - * remove the buffer from the wait list and release the reference, we > > - * want to return with the buffer queued to the original list. The > > - * buffer already sits on the original list with a wait list reference, > > - * however. If we let the queue inherit that wait list reference, all we > > - * need to do is reset the DELWRI_Q flag. > > + * The buffer is now locked, under I/O and wait listed on the original > > + * delwri queue. Wait for I/O completion, restore the DELWRI_Q flag and > > + * return with the buffer unlocked and on the original queue. > > */ > > - xfs_buf_lock(bp); > > + xfs_buf_iowait(bp); > > error = bp->b_error; > > error = xfs_buf_iowait()? > > Or is there a specific reason you collected the error separately? > Nope, just an oversight. Will fix.. Brian > Looks reasonable and I could follow along this time. :) > > --D > > > bp->b_flags |= _XBF_DELWRI_Q; > > xfs_buf_unlock(bp); > > -- > > 2.17.1 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html