[PATCH 2/2] xfs: allow delwri requeue of wait listed buffers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



If a delwri queue occurs of a buffer that was previously submitted
from a delwri queue but has not yet been removed from a wait list,
the queue sets _XBF_DELWRI_Q without changing the state of ->b_list.
This occurs, for example, if another thread beats the submitter
thread to the buffer lock after I/O completion. Once the submitter
acquires the lock, it removes the buffer from the wait list and
leaves a buffer with _XBF_DELWRI_Q set but not populated on a list.
This results in a lost buffer submission and in turn can result in
assert failures due to _XBF_DELWRI_Q being set on buffer reclaim or
filesystem lockups if the buffer happens to cover an item in the
AIL.

This problem has been reproduced by repeated iterations of xfs/305
on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty
dquot reclaim races with an xfsaild push of a separate dquot backed
by the same buffer such that the buffer sits on the reclaim wait
list at the time xfsaild attempts to queue it. Since the latter
dquot has been flush locked but the underlying buffer not submitted
for I/O, the dquot pins the AIL and causes the filesystem to
livelock.

To address this problem, allow a delwri queue of a wait listed
buffer to steal the buffer from the wait list and add it to the
associated delwri queue. This is fundamentally safe because the
purpose of the wait list is to provide synchronous I/O. The buffer
lock of each wait listed buffer is cycled to ensure that I/O has
completed. If another thread acquires the buffer lock first, then
the I/O has completed and the submitter lock cycle is a formality.

The tradeoff to this approach is that the submitter loses the
ability to check error state of stolen buffers. This is technically
already possible as once the lock is released there is no guarantee
another thread has not interfered with the buffer error state by the
time the submitter reacquires the lock. Further, most critical error
handling occurs in the iodone callbacks in completion context of the
specific buffer since the delwri submitter has no idea which buffer
failed in the first place. Finally, the stolen buffer case should be
relatively rare and limited to races when under the highly parallel
and low memory conditions described above.

Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
---
 fs/xfs/xfs_buf.c | 78 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 912cbbf44db7..ca0520b173c6 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1917,6 +1917,8 @@ xfs_buf_delwri_queue(
 	struct xfs_buf		*bp,
 	struct list_head	*list)
 {
+	bool			wait = (bp->b_flags & _XBF_DELWRI_W);
+
 	ASSERT(xfs_buf_islocked(bp));
 	ASSERT(!(bp->b_flags & XBF_READ));
 
@@ -1926,23 +1928,37 @@ xfs_buf_delwri_queue(
 	 * case.
 	 */
 	if (bp->b_flags & _XBF_DELWRI_Q) {
+		ASSERT(!wait);
 		trace_xfs_buf_delwri_queued(bp, _RET_IP_);
 		return false;
 	}
 
 	trace_xfs_buf_delwri_queue(bp, _RET_IP_);
 
+	/*
+	 * If the buffer is on a wait list, the I/O has completed and the
+	 * current thread beat the waiter to the lock. Drop the buffer from the
+	 * wait list and requeue it. The waiter will either see that DELWRI_W is
+	 * cleared or skip waiting on the buffer. The wait list reference is
+	 * reused for the delwri queue.
+	 */
+	if (wait) {
+		bp->b_flags &= ~_XBF_DELWRI_W;
+		list_del_init(&bp->b_list);
+	}
+
 	/*
 	 * If a buffer gets written out synchronously or marked stale while it
 	 * is on a delwri list we lazily remove it. To do this, the other party
 	 * clears the  _XBF_DELWRI_Q flag but otherwise leaves the buffer alone.
-	 * It remains referenced and on the list.  In a rare corner case it
-	 * might get readded to a delwri list after the synchronous writeout, in
-	 * which case we need just need to re-add the flag here.
+	 * It remains referenced and on the list. In a rare corner case it might
+	 * get readded to a delwri list after the synchronous writeout, in which
+	 * case we need just need to re-add the flag here.
 	 */
 	bp->b_flags |= _XBF_DELWRI_Q;
 	if (list_empty(&bp->b_list)) {
-		atomic_inc(&bp->b_hold);
+		if (!wait)
+			atomic_inc(&bp->b_hold);
 		list_add_tail(&bp->b_list, list);
 	}
 
@@ -2082,14 +2098,27 @@ xfs_buf_delwri_submit(
 
 	/* Wait for IO to complete. */
 	while (!list_empty(&wait_list)) {
+		bool release = false;
+
 		bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
-		/* locking the buffer will wait for async IO completion. */
+		/*
+		 * Lock the buffer to wait for async I/O completion. Note that
+		 * somebody else could have requeued the buffer in the meantime.
+		 * That means some buffers might drop from the wait list, but
+		 * that requires the buffer lock and thus means the I/O had
+		 * completed first.
+		 */
 		xfs_buf_lock(bp);
-		bp->b_flags &= ~_XBF_DELWRI_W;
-		list_del_init(&bp->b_list);
+		if (bp->b_flags & _XBF_DELWRI_W) {
+			bp->b_flags &= ~_XBF_DELWRI_W;
+			list_del_init(&bp->b_list);
+			release = true;
+		}
 		error2 = bp->b_error;
-		xfs_buf_relse(bp);
+		xfs_buf_unlock(bp);
+		if (release)
+			xfs_buf_rele(bp);
 		if (!error)
 			error = error2;
 	}
@@ -2120,38 +2149,31 @@ xfs_buf_delwri_pushbuf(
 	LIST_HEAD		(submit_list);
 	int			error;
 
-	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
-
 	trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
 
 	/*
-	 * Isolate the buffer to a new local list so we can submit it for I/O
-	 * independently from the rest of the original list.
+	 * Make sure the buffer sits on a queue and isolate it to a new local
+	 * list so we can submit it independently from the original list.
 	 */
 	xfs_buf_lock(bp);
+	if (!(bp->b_flags & _XBF_DELWRI_Q) || bp->b_flags & _XBF_DELWRI_W) {
+		xfs_buf_unlock(bp);
+		return 0;
+	}
 	list_move(&bp->b_list, &submit_list);
 	xfs_buf_unlock(bp);
 
 	/*
-	 * Delwri submission clears the DELWRI_Q buffer flag and returns with
-	 * the buffer on the wait list with an associated reference. Rather than
-	 * bounce the buffer from a local wait list back to the original list
-	 * after I/O completion, reuse the original list as the wait list.
+	 * Delwri submission moves the buffer over to the wait list. Reuse the
+	 * original list so there's no need to declare a local wait list.
+	 *
+	 * The buffer most likely still resides on the wait list by the time the
+	 * lock is reacquired. Requeue it just in case somebody else beat us to
+	 * the lock and changed the state.
 	 */
 	xfs_buf_delwri_submit_buffers(&submit_list, buffer_list);
-
-	/*
-	 * The buffer is now under I/O and wait listed as during typical delwri
-	 * submission. Lock the buffer to wait for I/O completion. Rather than
-	 * remove the buffer from the wait list and release the reference, we
-	 * want to return with the buffer queued to the original list. The
-	 * buffer already sits on the original list with a wait list reference,
-	 * however. If we let the queue inherit that wait list reference, all we
-	 * need to do is reset the DELWRI_Q flag.
-	 */
 	xfs_buf_lock(bp);
-	error = bp->b_error;
-	bp->b_flags |= _XBF_DELWRI_Q;
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_unlock(bp);
 
 	return error;
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux