Re: [PATCH 4/7] xfs: async CIL flushes need pending pushes to be made stable

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 16 Mar 2022 08:47:45 +1100

On Tue, Mar 15, 2022 at 12:36:24PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 15, 2022 at 05:42:38PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > When the AIL tries to flush the CIL, it relies on the CIL push
> > ending up on stable storage without having to wait for and
> > manipulate iclog state directly. However, if there is already a
> > pending CIL push when the AIL tries to flush the CIL, it won't set
> > the cil->xc_push_commit_stable flag and so the CIL push will not
> > actively flush the commit record iclog.
> > 
> > generic/530 when run on a single CPU test VM can trigger this fairly
> > reliably. This test exercises unlinked inode recovery, and can
> > result in inodes being pinned in memory by ongoing modifications to
> > the inode cluster buffer to record unlinked list modifications. As a
> > result, the first inode unlinked in a buffer can pin the tail of the
> > log whilst the inode cluster buffer is pinned by the current
> > checkpoint that has been pushed but isn't on stable storage because
> > because the cil->xc_push_commit_stable was not set. This results in
> > the log/AIL effectively deadlocking until something triggers the
> > commit record iclog to be pushed to stable storage (i.e. the
> > periodic log worker calling xfs_log_force()).
> > 
> > The fix is two-fold - first we should always set the
> > cil->xc_push_commit_stable when xlog_cil_flush() is called,
> > regardless of whether there is already a pending push or not.
> > 
> > Second, if the CIL is empty, we should trigger an iclog flush to
> > ensure that the iclogs of the last checkpoint have actually been
> > submitted to disk as that checkpoint may not have been run under
> > stable completion constraints.
> 
> Can it ever be the case that the CIL is not empty but the last
> checkpoint wasn't committed to disk?

Yes. But xlog_cil_push_now() will capture that, queue it and mark
it as xc_push_commit_stable. 

Remember that the push_now() code updates the push seq/stable
state under down_read(ctx lock) + spin_lock(push lock) context. The
push seq/stable state is cleared by the push worker under
down_write(ctx lock) + spin_lock(push lock) conditions when it
atomically swaps in the new empty CIL context. Hence the push worker
will always see the stable flag if it has been set for that push
sequence.

> Let's say someone else
> commits a transaction after the worker samples xc_push_commit_stable?

If we race with commits between the xlog_cil_push_now(log, seq,
true) and the CIL list_empty check in xlog_cil_flush(), there are
two posibilities:

1. the CIL push worker hasn't run and atomically switched in a new
CIL context.
2. the CIL push worker has run and switched contexts

In the first case, the commit will end up in the same context that
xlog_cil_flush() pushed, and it will be stable. That will result in
an empty CIL after the CIL push worker runs, but the racing commit
will be stable as per the xc_push_commit_stable flag. This can also
lead to the CIL being empty by the time the list_empty check is done
(because pre-empt), in which case the log force will be a no-op
because none of the iclogs need flushing.

> IOWs, why does a not-empty CIL mean that the last checkpoint is on disk?

In the second case, the CIL push triggered by xlog_cil_push_now()
will be stable because xc_push_commit_stable says it must be, and
the racing commit will end up in the new CIL context and the CIL
won't be empty. We don't need a log force in this case because the
previous sequence that was flushed with stable semantics as required.

In the case of AIL pushing, we don't actually care about racing CIL
commits because we are trying to get pinned AIL items unpinned so we
can move the tail of the log forwards. If those pinned items are
relogged by racing transactions, then the next call to
xlog_cil_flush() from the AIL will get them unpinned and that will
move them forward in the log, anyway.

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx