On Tue, Dec 28, 2021 at 09:36:22PM +0100, Manfred Spraul wrote: > Hi, > > with simulated power failures, I see a corrupted journal > > [39056.200845] JBD2: journal transaction 6943 on loop0-8 is corrupt. > [39056.200851] EXT4-fs (loop0): error loading journal This means that the journal replay found a commit which was *not* the last commit, and which contained a CRC error. If it's the last commit (e.g., there is no valid subsequent commit block), then it's possible that the journal commit was never completed before the system crashed --- e.g., it was an interrupted commit. Your test is aborting the commit at various points in the write I/O stream, so it should be simulating an interrupted commit (assuming that it's not corrupting any I/O. So the jbd2 layer should have understood it was the last commit in the journal, and been OK with the checksum failure. But what can happen is that if there is a commit block in the right place at the end of the transaction, left over from the previous journalling session, this can confuse the jbd2 layer into thinking that it is *not* the last transaction, and then it will make the "journal transaction is corrupt" report. How does the jbd2 layer determine whether there is a valid "subsequent commit", well if the subsequent commit block meets the following two criteria: * the commit id is the correct, expected one (n+1 the previous commit id). * the commit time (seconds since January 1, 1970) in the commit block is greater than the comit time in the previous commit block. So if your test setup doesn't correctly set the time (say, it always leaves the bootup time to January 1, 1970), and the workload is extremely regular, it's possible that the replay interrupted a journal commit, but there was left-over commit block that *looked* valid, and it triggered the failure. If this is what happened, it's not a disaster --- the journal replay will have correctly stopped where it should have, but it thought it was an exceptional abort, as opposed to a normal journal replay commpletion. So the "file system is corrupted flag" will be set, forcing an fsck, but the fsck shouldn't find any problems with the file system. Does this explanation seem to fit with how your test setup is arranged? - Ted