[PATCH] xfs: close xfs_wait_buftarg() race with buffer lookups

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 22 Aug 2016 12:31:01 -0400

xfs_wait_buftarg() is invoked on unmount and filesystem freeze to drain and
free the buffer LRU. We have reports of filesystem freeze hangs with the
following call chain:

  ->xfs_wait_buftarg()
    ->xfs_log_quiesce()
      ->xfs_quiesce_attr()
        ->xfs_fs_freeze()
          ...

This hang can reproduced with a long enough running fsstress instance
running in parallel with a tight freeze/unfreeze loop. The cause of the
hang is racing b_hold updates between xfs_wait_buftarg() and
_xfs_buf_lookup(). Specifically, buftarg wait path checks whether a
buffer has a >1 b_hold count to determine whether to skip the buffer.
If b_hold == 1, xfs_wait_buftarg_rele() proceeds to prepare the buffer
for the final removal and ultimately calls xfs_buf_rele() to drop the
LRU reference.

The problem is that _xfs_buf_find() can acquire a b_hold reference any
time after xfs_buftarg_wait_rele() has decided it has the only remaining
reference. If this occurs, xfs_wait_buftarg() drops the LRU reference,
but the xfs_buf_rele() instance doesn't actually remove the buffer from
the LRU due to the _xfs_buf_find() hold. At this point b_hold == 1, yet
the buffer is held via the _xfs_buf_find() codepath and still remains on
the LRU. Both call chains will ultimately call xfs_buf_rele() on a
buffer with b_hold == 1. This can have several user facing side effects
such as use after free errors or the freed buffer remaining on the LRU
indefinitely with an underflowed or garbage b_hold count value.

Update xfs_buftarg_wait_rele() to properly synchronize the LRU drain
against buffer lookups. Atomically decrement and lock the perag
associated with the buffer to lock out buffer lookups before
xfs_wait_buftarg() determines whether the LRU holds the final reference.
Open code freeing of the buffer so we can remove it from the perag
rbtree before the perag lock is dropped and guarantee it cannot be
looked up once freeing is imminent. Also update xfs_buftarg_wait_rele()
to drop and reacquire the lru_lock in the correct order to avoid
deadlocks with LRU insertion.

Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
---

A couple notes... first is that are several different ways to close this
race. I opted for this approach because xfs_wait_buftarg() is the
slow/less frequent path.

Second, I believe we still have an independent freeze hang upstream due
to an issue with how we use drain_workqueue(). Specifically, I see the
following warning from kernel/workqueue.c:__queue_work():

	if (unlikely(wq->flags & __WQ_DRAINING) &&
	    WARN_ON_ONCE(!is_chained_work(wq)))
		return;

... followed by a hang. I suspect this is due to the ioend work being
dropped and thus never completing the I/O, but I haven't dug into it
yet. For now, I'm testing for hangs with this fix and the above
commented out.

Brian

 fs/xfs/xfs_buf.c | 52 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 607cc29..ce9c419 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1570,24 +1570,48 @@ xfs_buftarg_wait_rele(
 {
 	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
 	struct list_head	*dispose = arg;
+	struct xfs_perag	*pag = bp->b_pag;
+	int			release;
+
+	/*
+	 * Drop lru_lock to avoid deadlocks with LRU insertion. The required
+	 * lock order to safely isolate a buffer is pag_buf_lock -> b_lock
+	 * -> lru_lock.
+	 */
+	spin_unlock(lru_lock);
 
-	if (atomic_read(&bp->b_hold) > 1) {
+	/*
+	 * Atomically decrement and lock the perag to synchronize against buffer
+	 * lookups. We unconditionally decrement because a check for b_hold == 1
+	 * alone is not sufficient. _xfs_buf_find() can acquire a reference
+	 * immediately after the check.
+	 */
+	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
+	if (!release) {
 		/* need to wait, so skip it this pass */
 		trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
-		return LRU_SKIP;
+		goto skip_buf;
 	}
-	if (!spin_trylock(&bp->b_lock))
-		return LRU_SKIP;
+	if (!spin_trylock(&bp->b_lock)) {
+		spin_unlock(&pag->pag_buf_lock);
+		goto skip_buf;
+	}
+	spin_lock(lru_lock);
+
+	/* remove from the rbtree to prevent further lookups */
+	rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
+	spin_unlock(&pag->pag_buf_lock);
 
-	/*
-	 * clear the LRU reference count so the buffer doesn't get
-	 * ignored in xfs_buf_rele().
-	 */
-	atomic_set(&bp->b_lru_ref, 0);
 	bp->b_state |= XFS_BSTATE_DISPOSE;
 	list_lru_isolate_move(lru, item, dispose);
 	spin_unlock(&bp->b_lock);
 	return LRU_REMOVED;
+
+skip_buf:
+	/* skipping the buf, put back the LRU hold */
+	atomic_inc(&bp->b_hold);
+	spin_lock(lru_lock);
+	return LRU_SKIP;
 }
 
 void
@@ -1629,7 +1653,15 @@ xfs_wait_buftarg(
 				xfs_alert(btp->bt_mount,
 "Please run xfs_repair to determine the extent of the problem.");
 			}
-			xfs_buf_rele(bp);
+
+			/*
+			 * The buffer has already been removed from the rbtree
+			 * and had the last reference dropped. Drop the perag
+			 * reference and free it.
+			 */
+			ASSERT(atomic_read(&bp->b_hold) == 0);
+			xfs_perag_put(bp->b_pag);
+			xfs_buf_free(bp);
 		}
 		if (loop++ != 0)
 			delay(100);
-- 
2.5.5

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs