Re: [PATCH] xfs: prevent spurious "head behind tail" warnings

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 21 Nov 2013 12:42:55 +1100

On Tue, Nov 19, 2013 at 05:08:33PM -0600, Mark Tinguely wrote:
> On 11/19/13 16:37, Dave Chinner wrote:
> >From: Dave Chinner<dchinner@xxxxxxxxxx>
> >
> >When xlog_space_left() cracks the grant head and the log tail, it
> >does so without locking to synchronise the sampling of the
> >variables. It samples the grant head first, so if there is a delay
> >before it smaples the log tail, there is a window where the log tail
> >could have moved onwards and be moved past the sampled value of the
> >grant head. This then leads to the "xlog_space_left: head behind
> >tail" warning message.
> >
> >To avoid spurious output in this situation, swap the order in which
> >the variables are cracked. This means that the head may grant head
> >may move if there is a delay, but the log tail will be stable, hence
> >ensure the tail does not jump the head accidentally.
> >
> >While this avoids the spurious head behind tail problem, it
> >introduces the opposite problem - the head can move more than a full
> >cycle past the tail. The code already handles this case by
> >indicating that the log is full (i.e. zero space available) but
> >that's still (generally) a spurious situation.
> >
> >Hence, if we detect that the head is more than a cycle ahead of the
> >tail or the head is behind the tail, start the calculation again by
> >resampling the variables and trying again. If we get too many
> >resamples, then throw a warning and return a full or empty log
> >appropriately.
> >
> >Signed-off-by: Dave Chinner<dchinner@xxxxxxxxxx>
> >---
> 
> I am still getting the debug message:
> 
>   xlog_verify_grant_tail: space > BBTOB(tail_blocks)

I'm not sure why that is relevant to this patch - it doesn't touch
xlog_verify_grant_tail(), nor does it change the functioning of the
code except to remove false positive warnings.

Further, the comment about xlog_verify_grant_tail() says:

  * This check is run unlocked, so can give false positives. 

That's exactly what this patch addresses for xlog_space_left() -
it prevents false positives when it is run unlocked. So we could do
exactly the same thing to xlog_verify_grant_tail() as well. Patch
attached below.

> This is a real over grant.

Evidence, please.

> It has been a while since I did all the
> tests, but basically the only way to stop it is to have a lock
> between checking for xlog_space_left() and actually reserving the
> space.

Yes, so you demonstrated, but that change also completely destroyed
scalability of the transaction subsystem in the process, hence is a
non-starter, especially just to silence a debug-only false positive.
IOWs, there was no evidence provided that the message was the result
of an actual leak (i.e. a problem in the algorithm) rather than it
being a false positive.

The first step in determining if we really have a problem here is to
cut the false positive rate, and that's exactly what I'm addressing
here. Hence when we see this warning being emitted we will have a
much higher confidence that it is a real problem and not a false
positive.

> I am not a fan of another band-aid on a problem that is caused
> because we are granting space without locks.

Lockless algorithms *by their very nature* involve verifying what
was grabbed without a lock is valid before using it. It's a
fundamental principle underlying lockless algorithms, whether they
be lockless due to use of cmpxchg (like the log grant code) or RCU
(like the xfs inode cache) or just plain old memory barriers to
enforce memory access ordering.

Just because the original code didn't do this doesn't mean doing it
now is a band-aid. I understand lockless algorithms a whole lot
better than I did 3 years ago when I wrote the original algorithm,
and as such can see things I didn't clearly understand back then.
Indeed, the comments indicate that I knew there were problems (i.e.
false positive detection issues), but didn't understand them well
enough to be able to mitigate them. I certainly understand them well
enough now, so I'd hardly call mitigating known issues like this a
band-aid.

Looking at the bigger picture, we're adding lockless algorithms
rapidly throughout the kernel.  Hence it's critical that kernel
developers understand how lockless algorithms work and realise that
locks are no longer the preferred solution to the scalability
problem. If a lockless algorithm is broken, then we need to
understand exactly how it is broken, determine whether it can be
fixed (e.g. is it just missing a critical memory barrier?), and as
a last resort, if it needs to be replaced then we need to design
a new algorithm that scales equivalently well.

We need to start by killing as many false positives the lockless
algorithm throws out and see what remains. We have no real evidence
that there are fundamental problems with the lockless algorithm, so
let's address the issues we know about first and go from there.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

xfs: prevent spurious "space > BBTOB(tail_blocks)" warnings

When xlog_verify_grant_tail() cracks the grant head and the log
tail, it does so without locking to synchronise the sampling of the
variables. It samples the grant head first, so if there is a delay
before it samples the log tail, there is a window where the log tail
could have moved onwards and be moved past the sampled value of the
grant head. This then leads to the "xlog_verify_grant_tail: space >
BBTOB(tail_blocks)" warning message.

To avoid spurious output in this situation, swap the order in which
the variables are cracked. This means that the head may grant head
may move if there is a delay, but the log tail will be stable, hence
ensure the tail does not jump the head accidentally.

While this avoids the spurious head behind tail problem, it
introduces the opposite problem - the head can move more than a full
cycle past the tail.

Hence, if we detect that the head is more than a cycle ahead of the
tail or the head is behind the tail, start the calculation again by
resampling the variables and trying again. If we get too many
resamples, then throw a warning.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>

---
 fs/xfs/xfs_log.c | 82 +++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 64 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 0cfcc20..6857e09 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3682,26 +3682,72 @@ STATIC void
 xlog_verify_grant_tail(
 	struct xlog	*log)
 {
-	int		tail_cycle, tail_blocks;
-	int		cycle, space;
-
-	xlog_crack_grant_head(&log->l_write_head.grant, &cycle, &space);
-	xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_blocks);
-	if (tail_cycle != cycle) {
-		if (cycle - 1 != tail_cycle &&
-		    !(log->l_flags & XLOG_TAIL_WARN)) {
-			xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
-				"%s: cycle - 1 != tail_cycle", __func__);
-			log->l_flags |= XLOG_TAIL_WARN;
-		}
+	int		tail_cycle;
+	int		tail_blocks;
+	int		head_cycle;
+	int		head_bytes;
+	int		retries = 0;
+#define XLOG_VERIFY_MAX_RETRIES	5
 
-		if (space > BBTOB(tail_blocks) &&
-		    !(log->l_flags & XLOG_TAIL_WARN)) {
-			xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
-				"%s: space > BBTOB(tail_blocks)", __func__);
-			log->l_flags |= XLOG_TAIL_WARN;
-		}
+
+	/*
+	 * If we've already detected an problem here, don't bother checking
+	 * again.
+	 */
+	if (log->l_flags & XLOG_TAIL_WARN)
+		return;
+	do {
+		/*
+		 * sample tail before head to avoid spurious warnings due to
+		 * racing tail updates. We dump a memory barrier here to make
+		 * sure we pick up the latest values that have been written by
+		 * other threads each time through.
+		 */
+		smp_mb();
+		xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle,
+				      &tail_blocks);
+		xlog_crack_grant_head(&log->l_write_head.grant, &head_cycle,
+				      &head_bytes);
+
+		/*
+		 * if the cycles are the same, the head and tail can't be
+		 * overlapping, so everything is ok and we are done.
+		 */
+		if (tail_cycle == head_cycle)
+			return;
+
+		/*
+		 * if the tail is on the previous cycle to the head and the head
+		 * is before the tail, then all is good.
+		 */
+		if (tail_cycle == head_cycle - 1 &&
+		    BBTOB(tail_blocks) >= head_bytes)
+			return;
+
+		/*
+		 * Cycles don't match or the head overlapped the tail.
+		 * Invalid, so let's go around again and resample to ensure this
+		 * isn't a false positive.
+		 */
+	} while (retries++ < XLOG_VERIFY_MAX_RETRIES);
+
+	/*
+	 * OK, we're in trouble now - the head and tail are out of sync. Time to
+	 * issue a warning about it
+	 */
+	if (head_cycle - 1 != tail_cycle) {
+		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
+			"%s: head cycle - 1 != tail_cycle (0x%x/0x%x)",
+			      __func__, head_cycle, tail_cycle);
+	}
+
+	if (head_bytes > BBTOB(tail_blocks)) {
+		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
+			"%s: head_bytes > BBTOB(tail_blocks) (0x%x/0x%x)",
+			      __func__, head_bytes, BBTOB(tail_blocks));
 	}
+
+	log->l_flags |= XLOG_TAIL_WARN;
 }
 
 /* check if it will fit */

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs