Re: ext4: journal has aborted

Matteo Croce <technoboy85@xxxxxxxxx> · Wed, 2 Jul 2014 12:19:35 +0200

2014-07-02 12:17 GMT+02:00 David Jander <david@xxxxxxxxxxx>:
>
> Hi Eric,
>
> On Tue, 1 Jul 2014 12:36:46 -0400
> Eric Whitney <enwlinux@xxxxxxxxx> wrote:
>
>> * Theodore Ts'o <tytso@xxxxxxx>:
>> > On Tue, Jul 01, 2014 at 09:07:27PM +0900, Jaehoon Chung wrote:
>> > > Hi,
>> > >
>> > > i have interesting for this problem..Because i also found the same problem..
>> > > Is it Journal problem?
>> > >
>> > > I used the Linux version 3.16.0-rc3.
>> > >
>> > > [    3.866449] EXT4-fs error (device mmcblk0p13): ext4_mb_generate_buddy:756: group 0, 20490 clusters in bitmap, 20488 in gd; block bitmap corrupt.
>> > > [    3.877937] Aborting journal on device mmcblk0p13-8.
>> > > [    3.885025] Kernel panic - not syncing: EXT4-fs (device mmcblk0p13): panic forced after error
>> >
>> > This message means that the file system has detected an inconsistency
>> > --- specifically, that the number of blocks marked as in use in the
>> > allocation bbitmap is different from what is in the block group
>> > descriptors.
>> >
>> > The file system has been marked to force a panic after an error, at
>> > which point e2fsck will be able to repair the inconsistency.
>> >
>> > What's not clear is *how* the why this happened.  It can happen simply
>> > because of a hardware problem.  (In particular, not all mmc flash
>> > devices handle power failures gracefully.)  Or it could be a cosmic,
>> > ray, or it might be a kernel bug.
>> >
>> > Normally I would chalk this up to a hardware bug, bug it's possible
>> > that it is a kernel bug.  If people can reliably reproduce the problem
>> > where no power failures or other unclean shutdowns were involved
>> > (since the last time file system has been checked using e2fsck) then
>> > that would be realy interesting.
>>
>> Hi Ted:
>>
>> I saw a similar failure during 3.16-rc3 (plus ext4 stable fixes plus msync
>> patch) regression on the Pandaboard this morning.  A generic/068 hang
>> on data_journal required a reboot for recovery (old bug, though rarer lately).
>> On reboot, the root filesystem - default 4K, and on an SD card - went ro
>> after the same sort of bad block bitmap / journal abort sequence.  Rebooting
>> forced a fsck that cleared up the problem.  The target test filesystem was on
>> a USB-attached disk, and it did not exhibit the same problems on recovery.
>
> Please be careful about conclusions from regular SD cards and USB sticks for
> mass-storage. Unlike hardened eMMC (4.41+), these COTS mass-storage devices
> are not meant for intensive use and can perfectly easily corrupt data out of
> themselves. I've seen it happening many times already.
>
>> So, it looks like there might be more than just hardware involved here,
>> although eMMC/flash might be a common denominator.  I'll see if I can come up
>> with a reliable reproducer once the regression pass is finished if someone
>> doesn't beat me to it.
>
> I agree that there is a strong correlation towards flash-based storage, but I
> cannot explain why this factor would make a difference. How are flash-based
> block-devices different to ext4 than spinning-disk media (besides trim
> support)?

maybe the zero access time can trigger some race condition?

> Best regards,
>
> --
> David Jander
> Protonic Holland.

-- 
Matteo Croce
OpenWrt Developer
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html