On Wed, Feb 24, 2021 at 05:57:20PM +0530, Chandan Babu R wrote: > On 23 Feb 2021 at 13:35, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to > > guarantee the ordering requirements the journal has w.r.t. metadata > > writeback. THe two ordering constraints are: > > > > 1. we cannot overwrite metadata in the journal until we guarantee > > that the dirty metadata has been written back in place and is > > stable. > > > > 2. we cannot write back dirty metadata until it has been written to > > the journal and guaranteed to be stable (and hence recoverable) in > > the journal. > > > > The ordering guarantees of #1 are provided by REQ_PREFLUSH. This > > causes the journal IO to issue a cache flush and wait for it to > > complete before issuing the write IO to the journal. Hence all > > completed metadata IO is guaranteed to be stable before the journal > > overwrites the old metadata. > > > > The ordering guarantees of #2 are provided by the REQ_FUA, which > > ensures the journal writes do not complete until they are on stable > > storage. Hence by the time the last journal IO in a checkpoint > > completes, we know that the entire checkpoint is on stable storage > > and we can unpin the dirty metadata and allow it to be written back. > > > > This is the mechanism by which ordering was first implemented in XFS > > way back in 2002 by this commit: > > > > commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96 > > Author: Steve Lord <lord@xxxxxxx> > > Date: Fri May 24 14:30:21 2002 +0000 > > > > Add support for drive write cache flushing - should the kernel > > have the infrastructure > > > > A lot has changed since then, most notably we now use delayed > > logging to checkpoint the filesystem to the journal rather than > > write each individual transaction to the journal. Cache flushes on > > journal IO are necessary when individual transactions are wholly > > contained within a single iclog. However, CIL checkpoints are single > > transactions that typically span hundreds to thousands of individual > > journal writes, and so the requirements for device cache flushing > > have changed. > > > > That is, the ordering rules I state above apply to ordering of > > atomic transactions recorded in the journal, not to the journal IO > > itself. Hence we need to ensure metadata is stable before we start > > writing a new transaction to the journal (guarantee #1), and we need > > to ensure the entire transaction is stable in the journal before we > > start metadata writeback (guarantee #2). > > > > Hence we only need a REQ_PREFLUSH on the journal IO that starts a > > new journal transaction to provide #1, and it is not on any other > > journal IO done within the context of that journal transaction. > > > > The CIL checkpoint already issues a cache flush before it starts > > writing to the log, so we no longer need the iclog IO to issue a > > REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed > > to xlog_write(), we no longer need to mark the first iclog in > > the log write with REQ_PREFLUSH for this case. > > > > Given the new ordering semantics of commit records for the CIL, we > > need iclogs containing commit to issue a REQ_PREFLUSH. We also > > We flush the data device before writing the first iclog (containing > XLOG_START_TRANS) to the disk. This satisfies the first ordering constraint > listed above. Why is it required to have another REQ_PREFLUSH when writing the > iclog containing XLOG_COMMIT_TRANS? I am guessing that it is required to > make sure that the previous iclogs (belonging to the same checkpoint > transaction) have indeed been written to the disk. Yes, that is correct. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx