On Thu, Jan 28, 2021 at 10:12:05AM -0500, Brian Foster wrote: > On Thu, Jan 28, 2021 at 03:41:52PM +1100, Dave Chinner wrote: > ... > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > --- > > fs/xfs/xfs_log.c | 34 ++++++++++++++++++++++------------ > > fs/xfs/xfs_log_priv.h | 3 +++ > > 2 files changed, 25 insertions(+), 12 deletions(-) > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > index c5e3da23961c..8de93893e0e6 100644 > > --- a/fs/xfs/xfs_log.c > > +++ b/fs/xfs/xfs_log.c > ... > > @@ -2464,9 +2465,18 @@ xlog_write( > > ASSERT(log_offset <= iclog->ic_size - 1); > > ptr = iclog->ic_datap + log_offset; > > > > - /* start_lsn is the first lsn written to. That's all we need. */ > > - if (!*start_lsn) > > + /* > > + * Start_lsn is the first lsn written to. That's all the caller > > + * needs to have returned. Setting it indicates the first iclog > > + * of a new checkpoint or the commit record for a checkpoint, so > > + * also mark the iclog as requiring a pre-flush to ensure all > > + * metadata writeback or journal IO in the checkpoint is > > + * correctly ordered against this new log write. > > + */ > > + if (!*start_lsn) { > > *start_lsn = be64_to_cpu(iclog->ic_header.h_lsn); > > + iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; > > + } > > My understanding is that one of the reasons for the preflush per iclog > approach is that we don't have any submission -> completion ordering > guarantees across iclogs. This is why we explicitly order commit record > completions and whatnot, to ensure the important bits are ordered > correctly. The fact we implement that ordering ourselves suggests that > PREFLUSH|FUA itself do not provide such ordering, though that's not > something I've investigated. PREFLUSH provides ordering between completed IOs and the IO to be submitted. It does not provide any ordering guarantees against IO currently in flight, so the application needs to wait for the IOs it needs to order against to complete before issuing an IO with PREFLUSH. i.e. PREFLUSH provides a "many" completion to "single" submission ordering guarantee on stable storage. REQ_FUA only guarantees that when the write IO completes, it is on stable storage. It does not provide ordering guarantees against any IO in flight, nor IOs submitted while it is in flight. Once it completes, however, it is guaranteed taht any latter IO submission will hit stable storage after that IO. i.e. REQ_FUA provides a "single" completion to "many" submission ordering guarantee on stable storage. > In any event, if the purpose fo the PREFLUSH is to ensure that metadata > in the targeted LSN range is committed to stable storage, and we have no > submission ordering guarantees across non-commit record iclogs, what > prevents a subsequent iclog from the same checkpoint from completing > before the first iclog with a PREFLUSH? Fair point. I suspect that we should just do an explicit cache flush before we start the checkpoint, and then we don't have to worry about REQ_PREFLUSH for the first iclog in the checkpoint at all. Actually, I wonder if we can pipeline that - submit an async cache flush bio as soon as we enter the push work, then once we're ready to call xlog_write() having pulled the hundreds of thousands of log vectors off the CIL, we wait on the cache flush bio to complete. THis gets around the first iclog in a long checkpoint requiring cache flushing or FUA. It also means that if there is a single iclog for the checkpoint, we only need a FUA write as the cache flush has already been done... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx