On 23 Feb 2021 at 09:04, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to > guarantee the ordering requirements the journal has w.r.t. metadata > writeback. THe two ordering constraints are: > > 1. we cannot overwrite metadata in the journal until we guarantee > that the dirty metadata has been written back in place and is > stable. > > 2. we cannot write back dirty metadata until it has been written to > the journal and guaranteed to be stable (and hence recoverable) in > the journal. > > These rules apply to the atomic transactions recorded in the > journal, not to the journal IO itself. Hence we need to ensure > metadata is stable before we start writing a new transaction to the > journal (guarantee #1), and we need to ensure the entire transaction > is stable in the journal before we start metadata writeback > (guarantee #2). > > The ordering guarantees of #1 are currently provided by REQ_PREFLUSH > being added to every iclog IO. This causes the journal IO to issue a > cache flush and wait for it to complete before issuing the write IO > to the journal. Hence all completed metadata IO is guaranteed to be > stable before the journal overwrites the old metadata. > > However, for long running CIL checkpoints that might do a thousand > journal IOs, we don't need every single one of these iclog IOs to > issue a cache flush - the cache flush done before the first iclog is > submitted is sufficient to cover the entire range in the log that > the checkpoint will overwrite because the CIL space reservation > guarantees the tail of the log (completed metadata) is already > beyond the range of the checkpoint write. > > Hence we only need a full cache flush between closing off the CIL > checkpoint context (i.e. when the push switches it out) and issuing > the first journal IO. Rather than plumbing this through to the > journal IO, we can start this cache flush the moment the CIL context > is owned exclusively by the push worker. The cache flush can be in > progress while we process the CIL ready for writing, hence > reducing the latency of the initial iclog write. This is especially > true for large checkpoints, where we might have to process hundreds > of thousands of log vectors before we issue the first iclog write. > In these cases, it is likely the cache flush has already been > completed by the time we have built the CIL log vector chain. > Indeed, a single cache flush of the "data device" that is issued before writing the first iclog of a CIL context is sufficient to make sure that the metadata has really reached non-volatile storage. Reviewed-by: Chandan Babu R <chandanrlinux@xxxxxxxxx> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_log_cil.c | 29 +++++++++++++++++++++++++---- > 1 file changed, 25 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index c5cc1b7ad25e..8bcacd463f06 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -656,6 +656,7 @@ xlog_cil_push_work( > struct xfs_log_vec lvhdr = { NULL }; > xfs_lsn_t commit_lsn; > xfs_lsn_t push_seq; > + DECLARE_COMPLETION_ONSTACK(bdev_flush); > > new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); > new_ctx->ticket = xlog_cil_ticket_alloc(log); > @@ -719,10 +720,24 @@ xlog_cil_push_work( > spin_unlock(&cil->xc_push_lock); > > /* > - * pull all the log vectors off the items in the CIL, and > - * remove the items from the CIL. We don't need the CIL lock > - * here because it's only needed on the transaction commit > - * side which is currently locked out by the flush lock. > + * The CIL is stable at this point - nothing new will be added to it > + * because we hold the flush lock exclusively. Hence we can now issue > + * a cache flush to ensure all the completed metadata in the journal we > + * are about to overwrite is on stable storage. > + * > + * This avoids the need to have the iclogs issue REQ_PREFLUSH based > + * cache flushes to provide this ordering guarantee, and hence for CIL > + * checkpoints that require hundreds or thousands of log writes no > + * longer need to issue device cache flushes to provide metadata > + * writeback ordering. > + */ > + xfs_flush_bdev_async(log->l_mp->m_ddev_targp->bt_bdev, &bdev_flush); > + > + /* > + * Pull all the log vectors off the items in the CIL, and remove the > + * items from the CIL. We don't need the CIL lock here because it's only > + * needed on the transaction commit side which is currently locked out > + * by the flush lock. > */ > lv = NULL; > num_iovecs = 0; > @@ -806,6 +821,12 @@ xlog_cil_push_work( > lvhdr.lv_iovecp = &lhdr; > lvhdr.lv_next = ctx->lv_chain; > > + /* > + * Before we format and submit the first iclog, we have to ensure that > + * the metadata writeback ordering cache flush is complete. > + */ > + wait_for_completion(&bdev_flush); > + > error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true); > if (error) > goto out_abort_free_ticket; -- chandan