Re: ext4 inode corruption

Curt Wohlgemuth <curtw@xxxxxxxxxx> · Wed, 23 Sep 2009 15:50:53 -0700

Sorry to reply to self, but I'm now pretty sure that I understand this
problem.  (Of course this insight came mere hours after I sent this
email -- and not in the previous 4 days of staring at it.)

It's likely the same issue fixed by

       commit	1b774f669b4b02f4d2abf2792362ab72a2e124ab
       ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()

In the previous case, in no-journal mode an about-to-be-freed metadata
block is marked dirty and available for writeback.  The block is then
marked free, and re-used as a data block for a different inode; the
writeback takes place, corrupting the data block.

In this case, the newly-freed block is re-used as a *metadata* block
for a different inode.  Hence the same pattern we were seeing before:
eh_entries = 0, eh_max = 340.

These inodes were left on systems from kernels without the above
patch.  Accessing the files on *patched* kernels will still make the
BUG fire, hence the confusion.

Thanks,
Curt

On Wed, Sep 23, 2009 at 9:27 AM, Curt Wohlgemuth <curtw@xxxxxxxxxx> wrote:
> We've been seeing sporadic inode corruption on our ext4 partitions which
> we've been trying to analyze, without much success.  I'm wondering if
> anybody might have some clues as to where things might be going wrong.
>
> We find out about the corruption via a BUG firing in ext4_ext_get_blocks():
>
>        /*
>         * consistent leaf must not be empty;
>         * this situation is possible, though, _during_ tree modification;
>         * this is why assert can't be put in ext4_ext_find_extent()
>         */
>        BUG_ON(path[depth].p_ext == NULL && depth != 0);
>
> Of course, this fires long after the inode in question is corrupted.  With
> some diagnostics added in front of this bug, we can find the inodes; they
> all have characteristics like this:
>
> Output from debugfs' stat command:
>
>   Inode: 1195575   Type: regular    Mode:  0600   Flags: 0x80000
>   Generation: 2821101782    Version: 0x00000001
>   User: 35800   Group:  5000   Size: 8400896
>   File ACL: 0    Directory ACL: 0
>   Links: 1   Blockcount: 8
>   Fragment:  Address: 0    Number: 0    Size: 0
>   ctime: 0x4a9f8009 -- Thu Sep  3 01:36:25 2009
>   atime: 0x4a9f7ff7 -- Thu Sep  3 01:36:07 2009
>   mtime: 0x4a9f8009 -- Thu Sep  3 01:36:25 2009
>   EXTENTS:
>
> Note that no data blocks are printed out here.
>
> Following the actual extent tree, it always looks like this:
>
>   in-inode extent header:
>     eh_magic: 0xf30a
>     eh_entries: 1
>     eh_max: 4
>     eh_depth: 1
>
>   in-inode extent index 0:
>     ei_block: 0
>     ei_leaf_lo: 36738577
>     ei_leaf_hi: 0
>
>      leaf node header (at block 36738577):
>        eh_magic: 0xf30a
>        eh_entries: 0
>        eh_max: 340
>        eh_depth: 0
>
> The i_size value of the inode will vary, from 8192 to 8400896.  But the
> i_blocks value is *always* 8.
>
> The extent tree always has depth of 1 in the in-inode header, and a valid
> leaf node header; but the leaf node header always has 0 entries.  This is
> what's causing the BUG above to fire.
>
> We believe the general pattern of user space calls to create these files is
> something like this:
>
>   open(O_DIRECT)
>   fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 8400896)
>   < various writes to the file >
>   fallocate(fd, 0, 0, actual_size + BLOCK_SIZE)
>   ftruncate(fd, actual_size)
>
> The second fallocate() call without KEEP_SIZE allows the following
> ftruncate to actually truncate the file -- a known issue recently fixed by
> Jiaying Zhang (but her fix is not in our kernel yet).  "actual_size" can be
> 0 at times.
>
> I can't think of any actions that would cause the i_size to be so large, yet
> the i_blocks always be 8.  Looking at the code in
>
>   ext4_ext_remove_space()
>   ext4_ext_rm_leaf()
>   ext4_ext_rm_idx()
>
> I don't see a way for the extent tree to take the shape above.  There are no
> errors that I can see around the time the corrupted inodes are created.  It
> *seems* as though the corruption is coming during truncation, but all our
> efforts to reproduce this with small test cases have so far failed.
>
> We're using a 2.6.26 code base, with most of the latest ext4 patches
> applied.
>
> Any insights/ruminations/guesses as to what might be happening are welcome.
>
> Thanks,
> Curt
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html