Re: [PATCH] xfs: xlog_cil_force_lsn doesn't always wait correctly

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 10 Sep 2014 08:04:00 +1000

On Tue, Sep 09, 2014 at 09:44:58AM -0400, Brian Foster wrote:
> On Tue, Sep 09, 2014 at 11:43:15AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > When running a tight mount/unmount loop on an older kernel, RedHat
> > QE found that unmount would occasionally hang in
> > xfs_buf_unpin_wait() on the superblock buffer. Tracing and other
> > debug work by Eric Sandeen indicated that it was hanging on the
> > writing of the superblock during unmount immediately after logging
> > the superblock counters in a synchronous transaction. Further debug
> > indicated that the synchronous transaction was not waiting for
> > completion correctly, and we narrowed it down to
> > xlog_cil_force_lsn() returning NULLCOMMITLSN and hence not pushing
> > the transaction in the iclog buffer to disk correctly.
> > 
> > While this unmount superblock write code is now very different in
> > mainline kernels, the xlog_cil_force_lsn() code is identical, and it
> > was bisected to the backport of commit f876e44 ("xfs: always do log
> > forces via the workqueue"). This commit made the CIL push
> > asynchronous for log forces and hence exposed a race condition that
> > couldn't occur on a synchronous push.
> > 
> > Essentially, the xlog_cil_force_lsn() relied implicitly on the fact
> > that the sequence push would be complete by the time
> > xlog_cil_push_now() returned, resulting in the context being pushed
> > being in the committing list. When it was made asynchronous, it was
> > recognised that there was a race condition in detecting whether an
> > asynchronous push has started or not and code was added to handle
> > it.
> > 
> > Unfortunately, the fix was not quite right and left a race condition
> > where it it would detect an empty CIL while a push was in progress
> > before the context had been added to the committing list. This was
> > incorrectly seen as a "nothing to do" condition and so would tell
> > xfs_log_force_lsn() that there is nothing to wait for, and hence it
> > would push the iclogbufs in memory.
> > 
> > The fix is simple, but explaining the logic and the race condition
> > is a lot more complex. The fix is to add the context to the
> > committing list before we start emptying the CIL. This allows us to
> > detect the difference between an empty "do nothing" push and a push
> > that has not started by adding a discrete "emptying the CIL" state
> > to avoid the transient, incorrect "empty" condition that the
> > (unchanged) waiting code was seeing.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > Tested-by: Eric Sandeen <esandeen@xxxxxxxxxx>
> > ---
> 
> The pusher side queues the work, acquires xc_push_lock, walks the
> committing list and then does this:
> 
> 	if (sequence == cil->xc_current_sequence &&
> 	    !list_empty(&cil->xc_cil)) {
> 		spin_unlock(&cil->xc_push_lock);
> 		goto restart;
> 	}
> 
> ... which is effectively peeking into the cil state to see if the push
> actually needed to do anything. The purpose is to identify the case
> where the push hasn't occurred and the previous committing list
> traversal was insufficient.
> 
> The "pushee" side drains the cil, updates the current sequence and adds
> the ctx to the committing list. The time between the cil drain and
> list_add() creates a window where, according to the logic above, the cil
> push looks like it wasn't going to do anything from the pusher
> perspective. The latter two bits occur under xc_push_lock (spinlock), so
> perhaps that widens a window to make this slightly more reproducible.
> 
> The pushee work is all done under xc_ctx_lock, so generally this is an
> issue because the pusher doesn't acquire xc_ctx_lock, and thus the cil
> drain and ctx update are not seen as atomic. I presume this is lockless
> by design, so taking the lock is not a desired solution.
> 
> Therefore, the order of changes to data structures that the pusher peeks
> into is updated on the pushee side to add to the committing list, drain
> the cil and update the current sequence number.
> 
> For the pusher side, this means the committing list walk and logic above
> (all under xc_push_lock) either sees the ctx on the list or not. If the
> ctx is on the list, then we waited for it and there is no race. If the
> ctx is not on the list, it is either populated and waiting to be added
> to the list (e.g., we hold xc_push_lock) or empty and there's nothing to
> do (regardless of whether the pusher side has run or not).
> 
> That seems sound to me. It would be nice of we could somehow make this
> explicitly more simple rather than dependent on complex ordering (e.g.,
> a push completion or something of that sort), but that's for another
> time. One nit below...

Yeah, it's complex and order dependent. I think I broke Eric trying
to explain it to him when debugging it. ;)

But the complexity is a side effect of the push lock being a
critically hot lock and so anything we can do to avoid needing to
hold it for longer or take it more often is worth the complexity. If
you can think of a way that is simpler and doesn't introduce new
contention points between pushers, forces and commits then I'm all
ears ;)

....

> Looks like out_skip could unlock xc_push_lock now.

Probably can be, but I'll leave it as is right now.

> Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx>

Thanks!

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs