AW: AW: ext4 filesystem bad extent error review

"Juergens Dirk (CM-AI/ECO2)" <Dirk.Juergens@xxxxxxxxxxxx> · Fri, 3 Jan 2014 19:45:40 +0100

On Thu, Jan 03, 2014 at 19:24, Eric Sandeen wrote
> 
> On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
> > So, I think there _might_ be a kernel bug, but it could be also a
> problem
> > related to the particular type of eMMC. We did not observe the same
> issue
> > in previous tests with another type of eMMC from another supplier,
> but this
> > was with an older kernel patch level and with another HW design.
> >
> > Regarding a possible kernel bug: Is there any chance that the invalid
> > ee_len or ee_start are returned by, e.g., the block allocator ?
> > If so, can we try to instrument the code to get suitable traces ?
> > Just to see or to exclude that the corrupted inode is really written
> > to the eMMC ?
> 
> From your description it does sound possible that it's a kernel bug.
> Adding testcases to the code to catch it before it hits the journal
> might be helpful - but then maybe this is something getting overwritten
> after the fact - hard to say.
> 
> Can you share more details of the test you are running?  Or maybe even
> the test itself?

Yes, for sure, we can. Weller, please provide additional details
or corrections. 

In short:
Basically we use an automated cyclic test writing many small 
(some kBytes) files with CRC checksums for easy consistency check
into a separate test partition. Files also contain meta information
like filename,  sequence number and a random number to allow to identify 
from block device image dumps, if we just see a fragment of an old
deleted file or a still valid one. 

Each test loop looks like this:
1) Boot the device after power on or reset
2) Do fsck -n BEFORE mounting
2 a) (optional) binary dump of the journal 
3) Mount test partition
4) File content check for all files from prev. loop
5) erase all files from previous loop
6) start writing hundreds/thousands of test files 
    in multiple directories with several threads
7) after random time cut the power or do soft reset

If 2), 3), 4) or 5) fails, stop test.

We are running the test usually with kind of transaction
safe handling, i.e. use fsync/rename, to avoid zero length files
or file fragments.

> 
> I've used a test framework in the past to simulate resets w/o needing
> to reset the box, and do many journal replays very quickly.  It'd be
> interesting to run it using your testcase.
> 
> Thanks,
> -Eric

Mit freundlichen Grüßen / Best regards

Dirk Juergens

Robert Bosch Car Multimedia GmbH
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html