On Thu, Jan 02, 2014 at 19:42, Theodore Ts'o [mailto:tytso@xxxxxxx] wrote: > On Thu, Jan 02, 2014 at 12:59:52PM +0800, Huang Weller (CM/ESW12-CN) > wrote: > > > > We did more test which we backup the journal blocks before we mount > the test partition. > > Actually, before we mount the test partition, we use fsck.ext4 with - > n option to verify whether there is any bad extents issues available. > The fsck.ext4 never found any such kind issue. And we can prove that > the bad extents issue is happened after journaling replay. > > Ok, so that implies that the failure is almost certainly due to > corrupted blocks in the journal. Hence, when we replay the journal, it > causes the the file system to become corrupted, because the "newer" > (and presumably, "more correct") metadata blocks found in the blocks > recorded in the journal are in fact corrupted. > ..... > > > > We searched such error on internet, there are some one also has such > issue. But there is no solution. > > This issue maybe not a big issue which it can be repaired by > fsck.ext4 easily. But we have below questions: > > 1. whether this issue already been fixed in the latest kernel version? > > 2. based on the information I provided in this mail, can you help to > solve this issue ? > > Well, the question is how did the journal get corrupted? It's possible > that it's caused by a kernel bug, although I'm not aware of any such > bugs being reported. > > In my mind, the most likely cause is that the SD card is ignoring the > CACHE FLUSH command, or is not properly saving the SD card's Flash > Translation Layer (FTL) metadata on a power drop. Yes, this could be a possible reason, but we did exactly the same test not only with power drops but also with doing only iMX watchdog resets. In the latter case there was no power drop for the eMMC, but we observed exactly the same kind of inode corruption. During thousands of test loops with power drops or watchdog resets, while creating thousands of files with multiple threads, we did not observe any other kind of ext4 metadata damage or file content damage. And in the error case so far we always found only a single damaged inode. The other inodes before and after the damaged inode in the journal, in the same logical 4096 bytes block, seem to be intact and valid (examined with a hex editor). And in all the failure cases - as far as we can say based on the ext4 disk layout documentation - only the ee_len or the ee_start_hi and ee_start_lo entries are wrong (i.e. zeroed). The eMMC has no "knowledge" about the logical meaning or the offset of ee_len or ee_start. Thus, it does not seem very likely that whatever kind of internal failure or bug in the eMMC controller/firmware always and only damages these few bytes. > What I tell people who are using flash devices is before they start > using any flash device, to do power drop testing on a raw device, > without any file system present. The simplest way to do this is to > write a program that writes consecutive 4k blocks that contain a > timestamp, a sequence number, some random data, and a CRC-32 checksum > over the contents of the timestamp, sequence number, a flags word, and > random data. As the program writes such 4k block, it rolls the dice > and once every 64 blocks or so (i.e., pick a random number, and see if > it is divisible by 64), then set a bit in the flags word indicating > that this block was forced out using a cache flush, and then when > writing this block, follow up the write with a CACHE FLUSH command. > It's also best if the test program prints the blocks which have been > written with CACHE FLUSH to the serial console, and that this is saved > by your test rig. We did similar tests in the past, but not yet with this particular type of eMMC. I think we should repeat with this particular type. > > (This is what ext4's journal does before and after writing the commit > block in the journal, and it guarantees that (a) all of the data in the > journal written up to the commit block will be available after a power > drop, and (b) that the commit block has been written to the storage > device and again, will be available after a power drop.) > Well, we also did the same tests with journal_checksum enabled. We were still able to reproduce the failure w/o any checksumming error. So we believe that the respective transaction (as well as all others) was complete and not corrupted by the eMMC. Is this a valid assumption ? If so, I would assume that the corrupted Inode was really written to the eMMC and not corrupted by the eMMC. (BTW, we do know that journal_checksum is somehow critical and might make things worse, but for test purpose and to exclude that the eMMC delivers corrupted transactions when reading the data, it seemed to be a meaningful approach) So, I think there _might_ be a kernel bug, but it could be also a problem related to the particular type of eMMC. We did not observe the same issue in previous tests with another type of eMMC from another supplier, but this was with an older kernel patch level and with another HW design. Regarding a possible kernel bug: Is there any chance that the invalid ee_len or ee_start are returned by, e.g., the block allocator ? If so, can we try to instrument the code to get suitable traces ? Just to see or to exclude that the corrupted inode is really written to the eMMC ? Mit freundlichen Grüßen / Best regards Dirk Juergens Robert Bosch Car Multimedia GmbH -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html