Re: [PATCH] xfs: don't free EFIs before the EFDs are committed

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 5 Apr 2013 11:45:40 +1100

On Thu, Apr 04, 2013 at 05:06:31PM -0500, Mark Tinguely wrote:
> On 04/02/13 22:09, Dave Chinner wrote:
> >From: Dave Chinner<dchinner@xxxxxxxxxx>
> >
> >Filesystems are occasionally being shut down with this error:
> >
> >xfs_trans_ail_delete_bulk: attempting to delete a log item that is
> >not in the AIL.
> >
> >It was diagnosed to be related to the EFI/EFD commit order when the
> >EFI and EFD are in different checkpoints and the EFD is committed
> >before the EFI here:
> >
> >http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
> >
> >The real problem is that a single bit cannot fully describe the
> >states that the EFI/EFD processing can be in. These completion
> >states are:
> >
> >EFI			EFI in AIL	EFD		Result
> >committed/unpinned	Yes		committed	OK
> >committed/pinned	No		committed	Shutdown
> >uncommitted		No		committed	Shutdown
> >
> >
> >Note that the "result" field is what should happen, not what does
> >happen. The current logic is broken and handles the first two cases
> >correctly by luck.  That is, the code will free the EFI if the
> >XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The
> >inverted logic "works" because if both EFI and EFD are committed,
> >then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
> >bit, and the second frees the EFI item. Hence as long as
> >xfs_efi_item_committed() has been called, everything appears to be
> >fine.
> >
> >It is the third case where the logic fails - where
> >xfs_efd_item_committed() is called before xfs_efi_item_committed(),
> >and that results in the EFI being freed before it has been
> >committed. That is the bug that triggered the shutdown, and hence
> >keeping track of whether the EFI has been committed or not is
> >insufficient to correctly order the EFI/EFD operations w.r.t. the
> >AIL.
> >
> >What we really want is this: the EFI is always placed into the
> >AIL before the last reference goes away. The only way to guarantee
> >that is that the EFI is not freed until after it has been unpinned
> >*and* the EFD has been committed. That is, restructure the logic so
> >that the only case that can occur is the first case.
> >
> >This can be done easily by replacing the XFS_EFI_COMMITTED with an
> >EFI reference count. The EFI is initialised with it's own count, and
> >that is not released until it is unpinned. However, there is a
> >complication to this method - the high level EFI/EFD code in
> >xfs_bmap_finish() does not hold direct references to the EFI
> >structure, and runs a transaction commit between the EFI and EFD
> >processing. Hence the EFI can be freed even before the EFD is
> >created using such a method.
> >
> >Further, log recovery uses the AIL for tracking EFI/EFDs that need
> >to be recovered, but it uses the AIL *differently* to the EFI
> >transaction commit. Hence log recovery never pins or unpins EFIs, so
> >we can't drop the EFI reference count indirectly to free the EFI.
> >
> >However, this doesn't prevent us from using a reference count here.
> >There is a 1:1 relationship between EFIs and EFDs, so when we
> >initialise the EFI we can take a reference count for the EFD as
> >well. This solves the xfs_bmap_finish() issue - the EFI will never
> >be freed until the EFD is processed. In terms of log recovery,
> >during the committing of the EFD we can look for the
> >XFS_EFI_RECOVERED bit being set and drop the EFI reference as well,
> >thereby ensuring everything works correctly there as well.
> 
> Duh me, beside what our problem where the cil push callbacks are
> called out of order, there can be several xfs_trans_committed_bulk()
> running at the same time.

I'm not sure that can actually happen - callback processing is
supposed to be serialised by xlog_state_do_callback()....

> Yep, the counter allows efi and efd to happen in any order and make
> sure the efi is entered into the AIL.
> 
> Okay, the pass2 of recovery places the efi item into the AIL, and
> the second decrement on the counter when the XFS_EFI_RECOVERED bit
> gets the counter to zero in the recovery case because there is no
> unpin.
> 
> Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>

Thank you.

> This patch also prevents the only chance of detecting of out of
> sequence ctx callbacks.

I don't think the callbacks are being attached out of order.  Like I
said previously: if that were happening then the commit records are
being written to the log in the wrong order and that is a *major*
issue.

As such, I've attached the informal proof that I just ran through to
verify that the current code is ordering commit records and hence
callbacks on iclogs correctly for multiple outstanding checkpoints.
I don't see any obvious hole in the logic as the callbacks are
attached to the iclog the commit record is written into and the
commit records are strictly serialised and ordered...

> The out of sequence ctx callbacks *does*
> matter because the li_lsn is set by the ctx->start_lsn and the
> lowest li_lsn is used to move the tail. When the callbacks are out
> of order, it is possible for the larger lsn of the greater ctx
> sequence to move the log tail beyond a lower ctx sequence entries
> before the entries for the lower sequence ctx are moved into the
> AIL.

Assuming we are getting callbacks running out of order, this is only
a problem if the AIL is empty at the time the out of order callbacks
are run. If the AIL is not empty, then the tail of the log is
derived directly from the LSN of the first item in the AIL and that
LSN will be from an item committed from a previous checkpoint, not
the current one that is runing out of order. Hence the tail of the
log is not affected in this case.

If the AIL is empty, then the tail LSN is determined by the lowest
LSN of the iclogs being processed concurrently. This code in
xlog_state_do_callback() is supposed to serialise and order the
completion of multiple concurrent iclog completions:

		lowest_lsn = xlog_get_lowest_lsn(log);
		if (lowest_lsn &&
		    XFS_LSN_CMP(lowest_lsn,
				be64_to_cpu(iclog->ic_header.h_lsn)) < 0) {
			iclog = iclog->ic_next;
			continue; /* Leave this iclog for
				   * another thread */
		}

IOWs, if the current iclog being checked for IO completion
processing has a higher LSN than the lowest in all the iclogs that
are not clean, then skip over it and check the next iclog. So, even
if we have multiple concurrent iclogs in
XLOG_STATE_DONE_SYNC/XLOG_STATE_DO_CALLBACK state, only the first
that comes across the iclog with the lowest LSN in these states is
supposed to run callbacks. And an iclog is considered in the
xlog_get_lowest_lsn() calculation until it becomes dirty/active,
which is a state transition that occurs *after* callbacks are run
on the iclog.

Further, once an iclog has been selected for callback processing, if
it has callbacks on it, we do this:

		if (iclog->ic_callback)
			atomic64_set(&log->l_last_sync_lsn,
				be64_to_cpu(iclog->ic_header.h_lsn));

to set the tail of the log to match that of the current LSN
of the iclog that is being processed. This is an "advance" update of
the log tail that will be used until items are inserted into
the AIL by callback processing (see xlog_assign_tail_lsn()). If
there are not items to be placed into the AIL (i.e. it remains
empty) then this ensures that the next log write still gets the tail
LSN correct.

And by this analysis, we should not be seeing concurrent execution
of callbacks on different iclogs, nor should we be seeing out of
order execution of callbacks on different iclogs. And because
callbacks are ordered correctly by the CIL push code we shouldn't be
seeing callbacks on a single iclog being run out of order
either.

IOWs, the tail LSN of the log should not be jumping around due to
out of order iclog and/or callback processing.  Unless there
is a problem in xlog_state_do_callback() with serialising and
ordering iclog IO completion processing, there shouldn't be a
problem with out of order callback processing, either.

Note that I haven't been through the iclog state machine in fine
detail since 2007, so I'm going to need to spend a bit of time
looking at it again to confirm this. I don't see any obvious problem
at the moment, though....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

commit CIL operation:

	down_write(ctx lock)
	increment sequence
	switch ctxs
	spin_lock(cil lock)
	add old ctx to commit list
	spin_unlock(cil lock)
	up_write(ctx lock)

	write checkpoint to log

	spin_lock(cil lock)
	walk commiting list and wait for prior sequences to commit
	spin_unlock(cil lock)

	xfs_log_done()		-> writes commit record, returns iclog, commit_lsn
	xfs_log_notify()	-> attaches callback to iclog tail

	spin_lock(cil lock)
	ctx->commit_lsn = commit_lsn
	wake waiters
	spin_unlock(cil lock)
	release iclog

	<iclog IO started>
	<iclog IO complete>
	<iclog run callbacks>

	spin_lock(cil lock)
	remove from committing list.
	spin_unlock(cil lock)

Informal proof of sequence commit order correctness by induction:

The initial context switching segment runs under an exclusive lock, and it means
that S2 can not be started (and hence committed) until S1 is already in the
committing list. So when S2 commits, if S1 has not yet written a commit record
we have the initial condition of:

	commiting list:
		S2, commit_lsn = 0
		S1, commit_lsn = 0

Now, if S2 writes it's checkpoint into the iclogs before S1 completes (e.g. S1
is a 10MB checkpoint, S2 is single inode core from a fsync) S2 enters the
committing list check and sees:

528 restart:
529         spin_lock(&cil->xc_cil_lock);
530         list_for_each_entry(new_ctx, &cil->xc_committing, committing) {

First entry is S2

531                 /*
532                  * Higher sequences will wait for this one so skip them.
533                  * Don't wait for own own sequence, either.
534                  */
535                 if (new_ctx->sequence >= ctx->sequence)
536                         continue;

new_ctx->sequence = S2, ctx->sequence = S2, so continue;

Second entry is S1, so we see:

new_ctx->sequence = S1, ctx->sequence = S2, so we  now check the commit lsn:

537                 if (!new_ctx->commit_lsn) {

new_ctx->commit_lsn = 0, so we:

538                         /*
539                          * It is still being pushed! Wait for the push to
540                          * complete, then start again from the beginning.
541                          */
542                         xlog_wait(&cil->xc_commit_wait, &cil->xc_cil_lock);
543                         goto restart;

wait and restart once woken.

544                 }
545         }
546         spin_unlock(&cil->xc_cil_lock);

Eventually, S1 finishes writing the checkpoint, and enters the commit record loop:

528 restart:
529         spin_lock(&cil->xc_cil_lock);
530         list_for_each_entry(new_ctx, &cil->xc_committing, committing) {

First entry is S2

531                 /*
532                  * Higher sequences will wait for this one so skip them.
533                  * Don't wait for own own sequence, either.
534                  */
535                 if (new_ctx->sequence >= ctx->sequence)
536                         continue;

new_ctx->sequence = S2, ctx->sequence = S1, so we continue.

Second entry is S1, so again we continue, hit the end of list and drop out of
the loop. At this point S1 writes the commit record, then attaches the callbacks
to the *tail* of the iclog callback list. In doing this, it gets the commit_lsn
of the commit record, so it does:

	ctx->commit_lsn = commit_lsn
	wakeup_all(&cil->xc_commit_wait);

Which wakes up S2. S2 now does:

_
528 restart:
529         spin_lock(&cil->xc_cil_lock);
530         list_for_each_entry(new_ctx, &cil->xc_committing, committing) {

First entry is S2

531                 /*
532                  * Higher sequences will wait for this one so skip them.
533                  * Don't wait for own own sequence, either.
534                  */
535                 if (new_ctx->sequence >= ctx->sequence)
536                         continue;

new_ctx->sequence = S2, ctx->sequence = S2, so we continue.

Second entry is S1, so we now check the commit lsn:

537                 if (!new_ctx->commit_lsn) {

new_ctx->commit_lsn != 0, so again we continue, hit the end of list and fall
through to the commit record write. This attaches the callback to the *tail* of
the iclog callback list, so it ordered *after* the callbacks for S1.

Hence the commit records appear to be correctly ordered by this code, as are the
callbacks being attached to iclogs. So, for the case of S1, S2 the code is safe.

For the case of S1, S2, S3, we have the same initial conditions:

	- S2 is not made active until S1 is on the commit list
	- S3 is not made active until S2 is on the commit list

Hence if we have S1 still committing while S2 and S3 have completed their
commits, the wait cycle looks like:

S2:
	Sees S2, continue;
	Sees S1, sees commit_lsn == 0, waits
S3:
	Sees S3, continue;
	Sees S2, sees commit_lsn == 0, waits

S1: completes commit,
	Sees S3, continue;
	Sees S2, continue;
	Sees S1, continue;
	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

S2:						S3:
	wake, restart				wake, restart
						[win wakeup race on cil lock]
						Sees S3, continue
						Sees S2, sees commit_lsn == 0, wait

	Sees S3, continue
	Sees S2, continue
	Sees S1, sees commit_lsn != 0, continue
						[lost wakeup race, spins on cil lock]
						[not woken until S2 already through]
						Sees S3, continue
						Sees S2, sees commit_lsn == 0, wait

	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

S3:
	Sees S3, continue
	Sees S2, sees commit_lsn != 0, continue
	Sees S1, sees commit_lsn != 0, continue
	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

And so the ordering of commit records and multiple callbacks on the same iclog
are correct for S1, S2 and S3.

So, for and arbitrary set of sequences {S1...Sn, S(n+1)}, the initial conditions are:

	- S2 is not made active until S1 is on the commit list
	....
	- S(n+1) is not made active until Sn is on the commit list

Hence if we have S1 still committing while Sn and S(n+1) have completed their
commits, the wait cycle looks like:

Sn:
	Sees Sn, continue;
	.....
	Sees S1, sees commit_lsn == 0, waits
S(n+1_:
	Sees S(n+1), continue;
	Sees Sn, sees commit_lsn == 0, waits

S1: completes commit,
	Sees S(n+1), continue;
	Sees Sn, continue;
	.....
	Sees S1, continue;
	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

Sn:						S(n+1):
	wake, restart				wake, restart
						[win wakeup race on cil lock]
						Sees S(n+1), continue
						Sees Sn, sees commit_lsn == 0, wait

	Sees S(n+1), continue
	Sees Sn, continue
	.....
	Sees S1, sees commit_lsn != 0, continue
						[lost wakeup race, spins on cil lock]
						[not woken until Sn already through]
						Sees S(n+1), continue
						Sees Sn, sees commit_lsn == 0, wait

	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

S3:
	Sees S(n+1), continue
	Sees Sn, sees commit_lsn != 0, continue
	.....
	Sees S1, sees commit_lsn != 0, continue
	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

Hence by induction the ordering of commit records and multiple callbacks on the
same iclog are correct for {S1...Sn, S(n+1)}

QED

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs