[PATCH] Re: XFS performance oddity

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 24 Nov 2010 16:28:23 +1100

On Wed, Nov 24, 2010 at 02:15:24PM +1100, Dave Chinner wrote:
> On Wed, Nov 24, 2010 at 11:50:03AM +1100, Nick Piggin wrote:
> > On Wed, Nov 24, 2010 at 07:58:04AM +1100, Dave Chinner wrote:
> > > On Tue, Nov 23, 2010 at 11:24:49PM +1100, Nick Piggin wrote:
> > > > Hi,
> > > > 
> > > > Running parallel fs_mark (0 size inodes, fsync on close) on a ramdisk
> > > > ends up with XFS in funny patterns.
> > > > 
> > > > procs -----------memory---------- ---swap-- -----io---- -system--
> > > > ----cpu----
> > > >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> > > > id wa
> > > > 24  1   6576 166396    252 393676  132  140 16900 80666 21308 104333  1 84 14  1
> > > > 21  0   6712 433856    256 387080  100  224  9152 53487 13677 53732  0 55 45  0
> > > >  2  0   7068 463496    248 389100    0  364  2940 17896 4485 26122  0 33 65  2
> > > >  1  0   7068 464340    248 388928    0    0     0     0   66  207  0  0 100  0
> > > >  0  0   7068 464340    248 388928    0    0     0     0   79  200  0  0 100  0
> > > >  0  0   7068 464544    248 388928    0    0     0     0   65  199  0  0 100  0
> > > >  1  0   7068 464748    248 388928    0    0     0     0   79  201  0  0 100  0
> > > >  0  0   7068 465064    248 388928    0    0     0     0   66  202  0  0 100  0
> > > >  0  0   7068 465312    248 388928    0    0     0     0   80  200  0  0 100  0
> > > >  0  0   7068 465500    248 388928    0    0     0     0   65  199  0  0 100  0
> > > >  0  0   7068 465500    248 388928    0    0     0     0   80  202  0  0 100  0
> > > >  1  0   7068 465500    248 388928    0    0     0     0   66  203  0  0 100  0
> > > >  0  0   7068 465500    248 388928    0    0     0     0   79  200  0  0 100  0
> > > > 23  0   7068 460332    248 388800    0    0  1416  8896 1981 7142  0  1 99  0
> > > >  6  0   6968 360248    248 403736   56    0 15568 95171 19438 110825  1 79 21  0
> > > > 23  0   6904 248736    248 419704  392    0 17412 118270 20208 111396  1 82 17  0
> > > >  9  0   6884 266116    248 435904  128    0 14956 79756 18554 118020  1 76 23  0
> > > >  0  0   6848 219640    248 445760  212    0  9932 51572 12622 76491  0 60 40  0
> > > > 
> > > > Got a dump of sleeping tasks. Any ideas?
> > > 
> > > It is stuck waiting for log space to be freed up. Generally this is
> > > caused by log IO completion not occurring or an unflushable object
> > > preventing the tail from being moved forward.  What:
> > 
> > Yeah it's strage, it seems like it hits some timeout or gets kicked
> > along by background writeback or something. Missed wakeup somewhere?
> 
> No idea yet.

It's a maze of twisty passages. Essentially, though, the problem is
that the inode (or inode buffer) at the tail of the log cannot be flushed
because the inode buffer is stale, pinned and locked. Hence it
cannot be flushed by the xfsaild (can't lock the buffer) and
everything stopsi because there is no log space available.

There's a burst of activity every 30s because the xfssyncd forces
the log which causes unpinning, unlocking and freeing of the
troublesome buffer which them frees up some log space.
Unfortunately, once in this state, there's a good chance it will
immediately get stuck on another buffer as soon as the load ramps up
again.

I think the problem is in _xfs_buf_find() where we do the trylock
operation on the buffer. In xfs_buf_lock() (where we block on the
lock) we do this:

        if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
		xfs_log_force(bp->b_target->bt_mount, 0);

That is, if the buffer is pinned and stale, we do an async log force
before going to sleep on the lock. This prevents having to wait for
a long time for the buffer to become unlocked and was introduced in
commit ed3b4d6c ("xfs: Improve scalability of busy extent tracking).
The reason was:


    The only problem with this approach is that when a metadata buffer is
    marked stale (e.g. a directory block is removed), then buffer remains
    pinned and locked until the log goes to disk. The issue here is that
    if that stale buffer is reallocated in a subsequent transaction, the
    attempt to lock that buffer in the transaction will hang waiting
    the log to go to disk to unlock and unpin the buffer. Hence if
    someone tries to lock a pinned, stale, locked buffer we need to
    push on the log to get it unlocked ASAP. Effectively we are trading
    off a guaranteed log force for a much less common trigger for log
    force to occur.

What we are seeing here is exactly the same problem, only it hasn't
been noticed because we are doing a trylock operation and most of
the time failure to obtain the lock does not matter. however, it
certainly appears as we need to also trigger a log force if the
trylock fails on a pinned, stale buffer.

The patch below should fix the problem.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

xfs: push stale, pinned buffers on trylock failures

From: Dave Chinner <dchinner@xxxxxxxxxx>

As reported by Nick Piggin, XFS is suffering from long pauses under
highly concurrent workloads when hosted on ramdisks. The problem is
that an inode buffer is stuck in the pinned state in memory and as a
result either the inode buffer or one of the inodes within the
buffer is stopping the tail of the log from being moved forward.

The system remains in this state until a periodic log force issued
by xfssyncd causes the buffer to be unpinned. The main problem is
that these are stale buffers, and are hence held locked until the
transaction/checkpoint that marked them state has been committed to
disk. When the filesystem gets into this state, only the xfssyncd
can cause the async transactions to be committed to disk and hence
unpin the inode buffer.

This problem was encountered when scaling the busy extent list, but
only the blocking lock interface was fixed to solve the problem.
Extend the same fix to the buffer trylock operations - if we fail to
lock a pinned, stale buffer, then force the log immediately so that
when the next attempt to lock it comes around, it will have been
unpinned.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/xfs/linux-2.6/xfs_buf.c |   36 +++++++++++++++++++-----------------
 1 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index aa1d353..94e69a7 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -488,29 +488,21 @@ found:
 	spin_unlock(&pag->pag_buf_lock);
 	xfs_perag_put(pag);
 
-	/* Attempt to get the semaphore without sleeping,
-	 * if this does not work then we need to drop the
-	 * spinlock and do a hard attempt on the semaphore.
+	/*
+	 * Attempt to get the semaphore without sleeping first. if we fail then
+	 * do a blocking lock if requested.
 	 */
-	if (down_trylock(&bp->b_sema)) {
+	if (xfs_buf_cond_lock(bp)) {
 		if (!(flags & XBF_TRYLOCK)) {
 			/* wait for buffer ownership */
 			xfs_buf_lock(bp);
 			XFS_STATS_INC(xb_get_locked_waited);
 		} else {
-			/* We asked for a trylock and failed, no need
-			 * to look at file offset and length here, we
-			 * know that this buffer at least overlaps our
-			 * buffer and is locked, therefore our buffer
-			 * either does not exist, or is this buffer.
-			 */
+			/* We asked for a trylock and failed.  */
 			xfs_buf_rele(bp);
 			XFS_STATS_INC(xb_busy_locked);
 			return NULL;
 		}
-	} else {
-		/* trylock worked */
-		XB_SET_OWNER(bp);
 	}
 
 	if (bp->b_flags & XBF_STALE) {
@@ -876,10 +868,18 @@ xfs_buf_rele(
  */
 
 /*
- *	Locks a buffer object, if it is not already locked.
- *	Note that this in no way locks the underlying pages, so it is only
- *	useful for synchronizing concurrent use of buffer objects, not for
- *	synchronizing independent access to the underlying pages.
+ *	Locks a buffer object, if it is not already locked.  Note that this in
+ *	no way locks the underlying pages, so it is only useful for
+ *	synchronizing concurrent use of buffer objects, not for synchronizing
+ *	independent access to the underlying pages.
+ *
+ *	If we come across a stale, pinned, locked buffer, we know that we are
+ *	being asked to lock a buffer that has been reallocated. Because it is
+ *	pinned, we know that the log has not been pushed to disk and hence it
+ *	will still be locked.  Rather than continuing to have trylock attempts
+ *	fail until someone else pushes the log, push it ourselves before
+ *	returning.  This means that the xfsaild will not get stuck trying
+ *	to push on stale inode buffers.
  */
 int
 xfs_buf_cond_lock(
@@ -890,6 +890,8 @@ xfs_buf_cond_lock(
 	locked = down_trylock(&bp->b_sema) == 0;
 	if (locked)
 		XB_SET_OWNER(bp);
+	else if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
+		xfs_log_force(bp->b_target->bt_mount, 0);
 
 	trace_xfs_buf_cond_lock(bp, _RET_IP_);
 	return locked ? 0 : -EBUSY;

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs