Hi folks, I've been thinking a little about the way we write use cache flushes recently and I was thinking about how we do journal writes and whether we need to issue as many cache flushes as we currently do. RIght now, every journal write is REQ_PREFLUSH | REQ_FUA, except in two cases: 1. the log and data devices are different, and we've already issued a cache flush to the data device, and 2. it's the second write of a split iclog (i.e. tail wrapping) and so the second bio doesn't need another pre-flush to be run before submission. For the purposes of my thinking, #1 above is irrelevant - if we are able to elide pre-flush from most iclog IOs, we can elide most of the data dev cache flushes when the log is on a separate device. However, it's #2 that has got me thinking: the pre-flush isn't necessary if the journal IO has already received a cache flush that covers the context of the larger journal write. This behaviour has been with us since, well, since explicit log IO cache flushes were added almost 2 decades ago by Steve Lord: commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96 Author: Steve Lord <lord@xxxxxxx> Date: Fri May 24 14:30:21 2002 +0000 Add support for drive write cache flushing - should the kernel have the infrastructure But things have changed greatly since then, namely delayed logging and CIL checkpoints. They have greatly changed how we use iclogs for writing changes to the journal, but the journal IO mechanisms are largely unchanged. That is, the pre-flush is ensuring that all the metadata in the LSN range of the journal that the iclog sync is about overwrite is already on stable storage. This means that the metadata writeback in that range of LSNs has been completed and so we can write the current log tail into this iclog and correctly order the journal IO against the already-signalled metadata completion by using a pre-IO cache flush on the journal IO. >From this observation, it seems to me that we only actually need a pre-flush to guarantee the metadata being overwritten is stable if the log tail has moved between the last iclog write and this one. i.e. if the tail hasn't moved, then a previous iclog write issued the cache flush to make the metadata in the current LSN overwrite range stable and so this iclog doesn't need to do it again. This implies that that we only need a device cache pre-flush if (iclog->ic_header.h_lsn != iclog->ic_prev->ic_header.h_lsn). This alone would greatly reduce the number of pre-flushes we issue during a long running CIL checkpoint as tail updates don't occur frequently. But the CIL and checkpoint transactions allows us to take this further. That is, when we start a checkpoint transaction we've actually reserved all the space in the log that we require to write the entire checkpoint. The tail *must* already be beyond where the end of the checkpoint will reach into the journal because the space reservations guarantee that the reserve head never wraps around past the current journal tail. Hence, when we start a new checkpoint, it will already fit into the dead LSN range between the head and the tail, and so we only need a single cache pre-flush to guarantee completion to submission ordering of the metadata and to-be-issued checkpoint IO into that LSN range. That means we only need one data device cache flush per CIL checkpoint, not one per iclog written in the CIL checkpoint. IOWs, if we running a 32MB CIL (2GB log) and running full checkpoints every push, we could reduce the number of cache flushes by a factor of 100 (256kb iclog = 128 journal IOs for a full CIL push) to 1000 (32kB journal = 1024 journal IOs). That's a significant, ongoing reduction in cache flushes. And then I woundered if we could apply the same logic to post-journal write cache flushes (REQ_FUA) that guarantee that the journal writes are stable before we allow writeback of the metadata in that LSN range (i.e. once they are unpinned). Again, we have a completion to submission ordering requirement here, only this time it is journal IO completion to metadata IO submission. IOWs, I think the same observation about the log head and the AIL writeback mechanism can be made here: we only need to ensure a cache flush occurs before we start writing back metadata at an LSN higher than the journal head at the time of the last cache flush. The first iclog write of last CIL checkpoint will have ensured all metadata lower than the LSN of the CIL checkpoint is stable, hence we only need to concern ourselves about metadata at the same LSN as that checkpoint. checkpoint completion will unpin that metadata, but we still need a cache flush to guarantee ordering at the stable storage level. Hence we can use an on-demand AIL traversal cache flush to ensure we have journal-to-metadata ordering. This will be much rarer than every using FUA for every iclog write, and should be of similar order of gains to the REQ_PREFLUSH optimisation. FWIW, because we use checksums to detect complete checkpoints in the journal now, we don't actually need to use FUA writes to guarantee they hit stable storage. We don't have a guarantee in what order they will hit the disk (even with FUA), so the only thing that the FUA write gains us is that on some hardware it elides the need for a post-write cache flush. Hence I don't think we need REQ_FUA, either. I suspect that the first iclog in a CIL checkpoint should also use REQ_FUA, if only to ensure that the log is dirtied as quickly as possible when a new checkpoint starts. The rest of the checkpoint ordering or cache flushing just doesn't matter until other external events require explicit ordering. The only explicit ordering we really have are log forces. As long as log forces issue a cache flush when they are left pending by CIL transaction completion, we shouldn't require anything more here. The situation is similar to the AIL requirement... In summary, I think we can get rid of almost all iclog REQ_PREFLUSH and REQ_FUA usage simply by: - first iclog write in a CIL checkpoint uses REQ_PREFLUSH - AIL traversal to an item with the current head LSN triggers a cache flush - SYNC log force triggers a cache flush if required to stabilise the last checkpoint written to the log, otherwise caller flushes caches as necessary (same as current fsync semantics) Given that metadata intensive workloads can result in hundreds to thousands of journal IOs every second, getting rid of the vast majority of these cache flush operations would be a major performance win for such workloads... Of course, this seems so simple now that I've written it, making me wonder what I've missed. Thoughts? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx