AW: AW: AW: ext4 filesystem bad extent error review

"Juergens Dirk (CM-AI/ECO2)" <Dirk.Juergens@xxxxxxxxxxxx> · Fri, 3 Jan 2014 19:56:45 +0100

On Thu, Jan 03, 2014 at 19:49, Eric Sandeen wrote
> 
> On 1/3/14, 12:45 PM, Juergens Dirk (CM-AI/ECO2) wrote:
> >
> > On Thu, Jan 03, 2014 at 19:24, Eric Sandeen wrote
> >>
> >> On 1/3/14, 10:29 AM, Juergens Dirk (CM-AI/ECO2) wrote:
> >>> So, I think there _might_ be a kernel bug, but it could be also a
> >> problem
> >>> related to the particular type of eMMC. We did not observe the same
> >> issue
> >>> in previous tests with another type of eMMC from another supplier,
> >> but this
> >>> was with an older kernel patch level and with another HW design.
> >>>
> >>> Regarding a possible kernel bug: Is there any chance that the
> invalid
> >>> ee_len or ee_start are returned by, e.g., the block allocator ?
> >>> If so, can we try to instrument the code to get suitable traces ?
> >>> Just to see or to exclude that the corrupted inode is really
> written
> >>> to the eMMC ?
> >>
> >> From your description it does sound possible that it's a kernel bug.
> >> Adding testcases to the code to catch it before it hits the journal
> >> might be helpful - but then maybe this is something getting
> overwritten
> >> after the fact - hard to say.
> >>
> >> Can you share more details of the test you are running?  Or maybe
> even
> >> the test itself?
> >
> > Yes, for sure, we can. Weller, please provide additional details
> > or corrections.
> >
> > In short:
> > Basically we use an automated cyclic test writing many small
> > (some kBytes) files with CRC checksums for easy consistency check
> > into a separate test partition. Files also contain meta information
> > like filename,  sequence number and a random number to allow to
> identify
> > from block device image dumps, if we just see a fragment of an old
> > deleted file or a still valid one.
> >
> > Each test loop looks like this:
> 
> 0) mkfs the filesystem - with what options?  How big?

Here we do need the details from Weller, cause 
he has done all this. 

> 
> > 1) Boot the device after power on or reset
> > 2) Do fsck -n BEFORE mounting
> > 2 a) (optional) binary dump of the journal
> > 3) Mount test partition
> 
> Again with what options, if any?

Details again have to be given by Weller, sorry.

> 
> > 4) File content check for all files from prev. loop
> > 5) erase all files from previous loop
> > 6) start writing hundreds/thousands of test files
> >     in multiple directories with several threads
> 
> I guess this is where we might need more details in order,
> to try to recreate the failure, but perhaps
> this is not a case where you can simply share the IO
> generation utility...?

I think we can share the code, please let me check on Monday.

> 
> Thanks,
> -Eric
> 
> > 7) after random time cut the power or do soft reset
> >
> > If 2), 3), 4) or 5) fails, stop test.
> >
> > We are running the test usually with kind of transaction
> > safe handling, i.e. use fsync/rename, to avoid zero length files
> > or file fragments.
> >
> >>
> >> I've used a test framework in the past to simulate resets w/o
> needing
> >> to reset the box, and do many journal replays very quickly.  It'd be
> >> interesting to run it using your testcase.
> >>
> >> Thanks,
> >> -Eric
> >
> > Mit freundlichen Grüßen / Best regards
> >
> > Dirk Juergens
> >
> > Robert Bosch Car Multimedia GmbH
> >

Mit freundlichen Grüßen / Best regards

Dirk Juergens

Robert Bosch Car Multimedia GmbH
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html