Re: [PATCH 2/5] xfs: separate CIL commit record IO

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 28 Jan 2021 10:07:41 -0500

FYI, I finally got to reading your prior log cache flushing thread
yesterday afternoon. I was planning to revisit that and probably reply
this morning after having some time to digest, but saw this and so will
reply here..

On Thu, Jan 28, 2021 at 03:41:51PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
> 
> This separate is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint. If
> the entire checkpoint and commit is in the one iclog, then they are
> both covered by the one set of cache flush primitives on the iclog
> and hence there is no need to separate them.
> 

I find the description here a bit vague.. we have to separate out the
commit record I/O, but only if it's already separate..? Glancing at the
code, I don't see any new "separation" happening, so it's not clear to
me what that really refers to. This looks more like we're introducing
serialization to provide checkpoint -> commit record ordering (when the
commit record happens to be in a separate iclog).

> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
> 
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_log.c      | 47 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_cil.c  |  7 +++++++
>  fs/xfs/xfs_log_priv.h |  2 ++
>  3 files changed, 56 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index c5f507c24577..c5e3da23961c 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -808,6 +808,53 @@ xlog_wait_on_iclog(
>  	return 0;
>  }
>  
> +/*
> + * Wait on any iclogs that are still flushing in the range of start_lsn to
> + * the current iclog's lsn. The caller holds a reference to the iclog, but
> + * otherwise holds no log locks.
> + *
> + * We walk backwards through the iclogs to find the iclog with the highest lsn
> + * in the range that we need to wait for and then wait for it to complete.
> + * Completion ordering of iclog IOs ensures that all prior iclogs to this IO are
> + * complete by the time our candidate has completed.
> + */
> +int
> +xlog_wait_on_iclog_lsn(
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		start_lsn)
> +{
> +	struct xlog		*log = iclog->ic_log;
> +	struct xlog_in_core	*prev;
> +	int			error = -EIO;
> +
> +	spin_lock(&log->l_icloglock);
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		goto out_unlock;
> +
> +	error = 0;
> +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> +
> +		/* Done if the lsn is before our start lsn */
> +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> +				start_lsn) < 0)
> +			break;

Hmm.. that logic looks a bit dodgy when you consider that the iclog
header lsn is reset to zero on activation. I think it actually works as
intended because iclog completion reactivates iclogs in LSN order and
this loop walks in reverse order, but that is a very subtle connection
that might be useful to document.

> +
> +		/* Don't need to wait on completed, clean iclogs */
> +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> +			continue;
> +		}
> +
> +		/* wait for completion on this iclog */
> +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> +		return 0;
> +	}
> +
> +out_unlock:
> +	spin_unlock(&log->l_icloglock);
> +	return error;
> +}
> +
>  /*
>   * Write out an unmount record using the ticket provided. We have to account for
>   * the data space used in the unmount ticket as this write is not done from a
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..c5cc1b7ad25e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,13 @@ xlog_cil_push_work(
>  	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn)
> +		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +

This is an interesting serialization point because we don't necessarily
submit the iclog that holds the commit record. I actually think in most
cases the separate commit record won't land on disk until the next CIL
push causes a sync of the current head iclog (assuming it fills it up).
Granted, this is the last point we have context of the commit record
being written, so it makes sense from an opportunistic standpoint to
serialize here (just as we already block to ensure commit record
ordering across checkpoints). That said, with the aggressive batching
ability of the CIL I think blocking on prior pushes is potentially less
heavy handed than unconditionally blocking on all prior iclog I/O. I
wonder if this might be something to address if we're going to compound
this path with more serialization.

>From a performance perspective, it seems like this makes the CIL push
_nearly_ synchronous by default. I.e., if we have a several hundred MB
CIL context, we're going to wait for all but the final iclog to complete
before we return. Of course we'll be waiting for much of that anyways to
reuse limited iclog space, but I have to think more about what that
means in general (maybe not much) and get through the subsequent
patches. In the meantime, have you put thought into any potential
negative performance impact from this that might or might not be offset
by subsequent flush optimizations?

Brian

>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
>  	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..a7ac85aaff4e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>  	remove_wait_queue(wq, &wait);
>  }
>  
> +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
> +
>  /*
>   * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>   * means that the next log record that includes this metadata could have a
> -- 
> 2.28.0
>