Re: ext4 inode corruption

Andreas Dilger <adilger@xxxxxxx> · Thu, 24 Sep 2009 12:27:49 -0600

On Sep 23, 2009  15:50 -0700, Curt Wohlgemuth wrote:
> Sorry to reply to self, but I'm now pretty sure that I understand this
> problem.  (Of course this insight came mere hours after I sent this
> email -- and not in the previous 4 days of staring at it.)
> 
> It's likely the same issue fixed by
> 
>        commit	1b774f669b4b02f4d2abf2792362ab72a2e124ab
>        ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()

I was going to say that this sounded like a familiar problem, but you
already did the leg (well, mouse) work.

> In the previous case, in no-journal mode an about-to-be-freed metadata
> block is marked dirty and available for writeback.  The block is then
> marked free, and re-used as a data block for a different inode; the
> writeback takes place, corrupting the data block.
> 
> In this case, the newly-freed block is re-used as a *metadata* block
> for a different inode.  Hence the same pattern we were seeing before:
> eh_entries = 0, eh_max = 340.
> 
> These inodes were left on systems from kernels without the above
> patch.  Accessing the files on *patched* kernels will still make the
> BUG fire, hence the confusion.
> 
> Thanks,
> Curt
> 
> 
> On Wed, Sep 23, 2009 at 9:27 AM, Curt Wohlgemuth <curtw@xxxxxxxxxx> wrote:
> > We've been seeing sporadic inode corruption on our ext4 partitions which
> > we've been trying to analyze, without much success.  I'm wondering if
> > anybody might have some clues as to where things might be going wrong.
> >
> > We find out about the corruption via a BUG firing in ext4_ext_get_blocks():
> >
> >        /*
> >         * consistent leaf must not be empty;
> >         * this situation is possible, though, _during_ tree modification;
> >         * this is why assert can't be put in ext4_ext_find_extent()
> >         */
> >        BUG_ON(path[depth].p_ext == NULL && depth != 0);
> >
> > Of course, this fires long after the inode in question is corrupted.  With
> > some diagnostics added in front of this bug, we can find the inodes; they
> > all have characteristics like this:
> >
> > Output from debugfs' stat command:
> >
> >   Inode: 1195575   Type: regular    Mode:  0600   Flags: 0x80000
> >   Generation: 2821101782    Version: 0x00000001
> >   User: 35800   Group:  5000   Size: 8400896
> >   File ACL: 0    Directory ACL: 0
> >   Links: 1   Blockcount: 8
> >   Fragment:  Address: 0    Number: 0    Size: 0
> >   ctime: 0x4a9f8009 -- Thu Sep  3 01:36:25 2009
> >   atime: 0x4a9f7ff7 -- Thu Sep  3 01:36:07 2009
> >   mtime: 0x4a9f8009 -- Thu Sep  3 01:36:25 2009
> >   EXTENTS:
> >
> > Note that no data blocks are printed out here.
> >
> > Following the actual extent tree, it always looks like this:
> >
> >   in-inode extent header:
> >     eh_magic: 0xf30a
> >     eh_entries: 1
> >     eh_max: 4
> >     eh_depth: 1
> >
> >   in-inode extent index 0:
> >     ei_block: 0
> >     ei_leaf_lo: 36738577
> >     ei_leaf_hi: 0
> >
> >      leaf node header (at block 36738577):
> >        eh_magic: 0xf30a
> >        eh_entries: 0
> >        eh_max: 340
> >        eh_depth: 0
> >
> > The i_size value of the inode will vary, from 8192 to 8400896.  But the
> > i_blocks value is *always* 8.
> >
> > The extent tree always has depth of 1 in the in-inode header, and a valid
> > leaf node header; but the leaf node header always has 0 entries.  This is
> > what's causing the BUG above to fire.
> >
> > We believe the general pattern of user space calls to create these files is
> > something like this:
> >
> >   open(O_DIRECT)
> >   fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 8400896)
> >   < various writes to the file >
> >   fallocate(fd, 0, 0, actual_size + BLOCK_SIZE)
> >   ftruncate(fd, actual_size)
> >
> > The second fallocate() call without KEEP_SIZE allows the following
> > ftruncate to actually truncate the file -- a known issue recently fixed by
> > Jiaying Zhang (but her fix is not in our kernel yet).  "actual_size" can be
> > 0 at times.
> >
> > I can't think of any actions that would cause the i_size to be so large, yet
> > the i_blocks always be 8.  Looking at the code in
> >
> >   ext4_ext_remove_space()
> >   ext4_ext_rm_leaf()
> >   ext4_ext_rm_idx()
> >
> > I don't see a way for the extent tree to take the shape above.  There are no
> > errors that I can see around the time the corrupted inodes are created.  It
> > *seems* as though the corruption is coming during truncation, but all our
> > efforts to reproduce this with small test cases have so far failed.
> >
> > We're using a 2.6.26 code base, with most of the latest ext4 patches
> > applied.
> >
> > Any insights/ruminations/guesses as to what might be happening are welcome.
> >
> > Thanks,
> > Curt
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html