On Tue, Jan 09, 2024 at 09:52:37AM -0500, Brian Foster wrote: > > > > > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on > > > > > LSN boundaries") was introduced, we submit buffers on lsn boundaries during > > > > > log recovery. > > > > > > > > Correct - we submit all the changes in a checkpoint for submission > > > > before we start recovering the next checkpoint. That's because > > > > checkpoints are supposed to be atomic units of change moving the > > > > on-disk state from one change set to the next. > > > > > > Submit buffer on LSN boundaries not means submit buffer on checkpoint > > > boundaries during recovery. In my understanding, One transaction on disk > > > corresponds to a checkpoint, there's maybe multiple transaction on disk > > > share same LSN, so sometimes we should ensure that submit multiple > > > transation one time in such case. This rule was introduced by commit > > > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries") > > > > Well, yes, that's exactly the situation that commit 12818d24db8a was > > intended to handle: > > > > "If independent transactions share an LSN and both modify the > > same buffer, log recovery can incorrectly skip updates and leave > > the filesystem in an inconsisent state." > > > > Unfortunately, we didn't take into account the complexity of > > mutliple transactions sharing the same start LSN in commit > > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN > > boundaries") back in 2016. > > > > Indeed, we didn't even know that there was a reliance on strict > > start record LSN ordering in journal recovery until 2021: > > > > commit 68a74dcae6737c27b524b680e070fe41f0cad43a > > Author: Dave Chinner <dchinner@xxxxxxxxxx> > > Date: Tue Aug 10 18:00:44 2021 -0700 > > > > xfs: order CIL checkpoint start records > > > > Because log recovery depends on strictly ordered start records as > > well as strictly ordered commit records. > > > > This is a zero day bug in the way XFS writes pipelined transactions > > to the journal which is exposed by fixing the zero day bug that > > prevents the CIL from pipelining checkpoints. This re-introduces > > explicit concurrent commits back into the on-disk journal and hence > > out of order start records. > > > > The XFS journal commit code has never ordered start records and we > > have relied on strict commit record ordering for correct recovery > > ordering of concurrently written transactions. Unfortunately, root > > cause analysis uncovered the fact that log recovery uses the LSN of > > the start record for transaction commit processing. Hence, whilst > > the commits are processed in strict order by recovery, the LSNs > > associated with the commits can be out of order and so recovery may > > stamp incorrect LSNs into objects and/or misorder intents in the AIL > > for later processing. This can result in log recovery failures > > and/or on disk corruption, sometimes silent. > > > > Because this is a long standing log recovery issue, we can't just > > fix log recovery and call it good. This still leaves older kernels > > susceptible to recovery failures and corruption when replaying a log > > from a kernel that pipelines checkpoints. There is also the issue > > that in-memory ordering for AIL pushing and data integrity > > operations are based on checkpoint start LSNs, and if the start LSN > > is incorrect in the journal, it is also incorrect in memory. > > > > Hence there's really only one choice for fixing this zero-day bug: > > we need to strictly order checkpoint start records in ascending > > sequence order in the log, the same way we already strictly order > > commit records. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Essentially, the problem now is that even with strictly ordered > > start records for checkpoints, checkpoints with the same start LSN > > interfere with each other in recovery because recovery is not > > aware of the fact that we can have multiple checkpoints that start > > with the same LSN. > > > > This is another zero-day issue with the journal and log recovery; > > originally there was no "anti-recovery" logic in the journal like we > > have now with LSNs to prevent recovery from taking metadata state > > backwards. Hence log recovery just always replayed every change > > that was in the journal from start to finish and so there was never > > a problem with having multiple start records in the same log record. > > > > However, this was known to cause problems with inodes and data vs > > metadata sequencing and non-transactional inode metadata updates > > (e.g. inode size), so a "flush iteration" counter was added to > > inodes in 2003: > > > > commit 6ed3d868e47470a301b49f1e8626972791206f50 > > Author: Steve Lord <lord@xxxxxxx> > > Date: Wed Aug 6 21:17:05 2003 +0000 > > > > Add versioning to the on disk inode which we increment on each > > flush call. This is used during recovery to avoid replaying an > > older copy of the inode from the log. We can do this without > > versioning the filesystem as the pad space we borrowed was > > always zero and will be ignored by old kernels. > > During recovery, do not replay an inode log record which is older > > than the on disk copy. Check for wrapping in the counter. > > > > This was never fully reliable, and there was always issues with > > this counter because inode changes weren't always journalled nor > > were cache flushes used to ensure unlogged inode metadata updates > > reached stable storage. > > > > The LSN sequencing was added to the v5 format to ensure metadata > > never goes backwards in time on disk without fail. The issue you've > > uncovered shows that we still have issues stemming from the > > original journal recovery algorithm that was not designed with > > anti-recovery protections in mind from the start. > > > > The problem we need to solve is how we preserve the necessary > > anti-recovery behaviour when we have multiple checkpoints that can > > have the same LSN and objects are updated immediately on recovery? > > > > I suspect that we need to track that the checkpoint being recovered > > has a duplicate start LSN (i.e. in the struct xlog_recover) and > > modify the anti-recovery LSN check to take this into account. i.e. > > we can really only skip recovery of the first checkpoint at any > > given LSN because we cannot disambiguate an LSN updated by the first > > checkpoint at that LSN and the metadata already being up to date on > > disk in the second and subsequent checkpoints at the same start > > LSN. > > > > There are likely to be other solutions - anyone have a different > > idea on how we might address this? > > > > It's been a while since I've looked at any of this and I haven't waded > through all of the details, so I could easily be missing something, but > what exactly is wrong with the approach of the patch as posted? > > Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN > boundaries") basically created a new invariant for log recovery where > buffers are allowed to be written only once per LSN. The risk otherwise > is that a subsequent update with a matching LSN would not be correctly > applied due to the v5 LSN ordering rules. Since log recovery processes > transactions (using terminology/granularity as defined by the > implementation of xlog_recover_commit_trans()), this required changes to > accommodate any of the various possible runtime logging scenarios that > could cause a buffer to have multiple entries in the log associated with > a single LSN, the details of which were orthogonal to the fix. > > The functional change therefore was that rather than to process and > submit "transactions" in sequence during recovery, the pending buffer > list was lifted to a higher level in the code, a tracking field was > added for the "current LSN" of log recovery, and only once we cross a > current LSN boundary are we allowed to submit the set of buffers > processed for the prior LSN. The reason for this logic is that seeing > the next LSN was really the only way we know we're done processing items > for a particular LSN. > > If I understand the problem description correctly, the issue here is > that if an error is encountered in the middle of processing items for > some LSN A, we bail out of recovery and submit the pending buffers on > the way out. If we haven't completed processing all items for LSN A > before failing, however, then we've just possibly violated the "write > once per LSN" invariant that protects from corrupting the fs. This is > because the writeback permanently updates metadata LSNs (assuming that > I/O doesn't fail), which means if recovery retries from the same point > the next time around and progresses to find a second instance of an > already written buffer in LSN A, it will exhibit the same general > behavior from before the write once invariant was introduced. IOW, > there's still a vector to the original problematic multi-write per LSN > behavior through multiple recovery attempts (hence the simulated I/O > error to reproduce). > > Long Li, am I following the problem description correctly? I've not > fully reviewed it, but if so, the proposed solution seems fairly sane > and logical to me. (And nice work tracking this down, BTW, regardless of > whether this is the final solution. ;). > Hi, Brian, your description is correct for me, and it is clear and easy to understand. Thanks for your encouragement of my work. Thanks, Long Li