Re: [PATCH 2/8] xfs: separate CIL commit record IO

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 25 Feb 2021 08:44:17 +1100

On Wed, Feb 24, 2021 at 12:34:29PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > To allow for iclog IO device cache flush behaviour to be optimised,
> > we first need to separate out the commit record iclog IO from the
> > rest of the checkpoint so we can wait for the checkpoint IO to
> > complete before we issue the commit record.
> > 
> > This separation is only necessary if the commit record is being
> > written into a different iclog to the start of the checkpoint as the
> > upcoming cache flushing changes requires completion ordering against
> > the other iclogs submitted by the checkpoint.
> > 
> > If the entire checkpoint and commit is in the one iclog, then they
> > are both covered by the one set of cache flush primitives on the
> > iclog and hence there is no need to separate them for ordering.
> > 
> > Otherwise, we need to wait for all the previous iclogs to complete
> > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > that the commit record iclog IO issues. This guarantees that if a
> > reader sees the commit record in the journal, they will also see the
> > entire checkpoint that commit record closes off.
> > 
> > This also provides the guarantee that when the commit record IO
> > completes, we can safely unpin all the log items in the checkpoint
> > so they can be written back because the entire checkpoint is stable
> > in the journal.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> >  fs/xfs/xfs_log_priv.h |  2 ++
> >  3 files changed, 64 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index fa284f26d10e..ff26fb46d70f 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > + * holds no log locks.
> > + *
> > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > + * in the range that we need to wait for and then wait for it to complete.
> > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > + * candidate iclog we need to sleep on have been complete by the time our
> > + * candidate has completed it's IO.
> 
> Hmm, I guess this means that iclog header lsns are supposed to increase
> as one walks forwards through the list?

yes, the iclogs are written sequentially to the log - we don't
switch the log->l_iclog pointer to the current active iclog until we
switch it out, and then the next iclog in the loop is physically
located at a higher lsn to the one we just switched out.

> > + *
> > + * Therefore we only need to find the first iclog that isn't clean within the
> > + * span of our flush range. If we come across a clean, newly activated iclog
> > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > + * activated iclog indicates that there are no iclogs in the range we need to
> > + * wait on and we are done searching.
> 
> I don't see an explicit check for an iclog with a zero lsn?  Is that
> implied by XLOG_STATE_ACTIVE?

It's handled by the XFS_LSN_CMP(prev_lsn, start_lsn) < 0 check.  if
the prev_lsn is zero because the iclog is clean, then this check
will always be true.

> Also, do you have any idea what was Christoph talking about wrt devices
> with no-op flushes the last time this patch was posted?  This change
> seems straightforward to me (assuming the answers to my two question are
> 'yes') but I didn't grok what subtlety he was alluding to...?

He was wondering what devices benefited from this. It has no impact
on highspeed devices that do not require flushes/FUA (e.g. high end
intel optane SSDs) but those are not the devices this change is
aimed at. There are no regressions on these high end devices,
either, so they are largely irrelevant to the patch and what it
targets...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx