On Tue, Jul 01, 2014 at 06:44:45AM -0000, Dolev Raviv wrote: > > Crash description: > I saw a BUG_ON assertion failure in function ext4_clear_journal_err(). The > assertion that fails is: !EXT4_HAS_COMPAT_FEATURE(sb, > EXT4_FEATURE_COMPAT_HAS_JOURNAL). > The strange thing is, that the same BUG_ON assertion is called at the > start of the function that calls ext4_clear_journal_err(), which is > ext4_load_journal(). This means that the capability flag is changed in > ext4_load_journal, before the call for journal_err(). > > I?m not too familiar with ext4 code unfortunately. From analyzing the > journal path I came to the below conclusions: > This scenario is possible, if during journal replay, the super_block is > restored or overridden from the journal. > I have noticed a case where the sb is marked as dirty and later, it is > evicted through the address_space_operations .writepage = ext4_writepage > cb. This cb is using the journal and can cause the dirty sb appear on the > journal. If during the journal write operation a power cut occurs, and the > sb copy in the journal is corrupted, it may cause the BUG_ON assertion > failure above. Yes, this is possible --- but if the journal has been corrupted, something pretty disastrous has happened. Indeed, if that has happenned, it may be that some other portions of the file system will also have been wiped out. So I'd ask the question of whether you have a bigger issue, such as crappy flash that is either not properly implementing the CACHE FLUSH operation, or which does not have proper transaction handling for its FTL metadata, so that even if the data blocks were correctly saved, if power gets removed while the SSD or eMMC flash is doing a GC operation, some data or metadata blocks (potentially including blocks written days or months ago) can get corrupted. Unfortunately, there does seem to be a huge number of crappy flash out there, and there's not much the file system can do about it. > Is the scenario described above even possible (or am I missing something)? > Has anyone encountered similar issues? Are there any known fixes for this? We do have journal checksums, but the reason why it hasn't been enabled by default is that e2fsck doesn't have good recovery from a corrupted journal. So it will detect a bad journal block, but we don't have good recovery strategies implemented yet. We could add a sanity check to make sure that, in the absense of journal checksums, if we are replaying the superblock and the journal copy of the superblock looks insane, to abort the journal replay. It's not going to help you recover the bad file system, but it will prevent the BUG_ON. Personally, I'd focus on why the journal got corrupted in the first place. A BUG_ON is transient; you reboot, and move on. Data corruption (at least in the absense of backups, and you *have* been doing backups, right?) is forever.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html