Re: ext34_free_inode's mess

Dmitry Monakhov <dmonakhov@xxxxxxxxxx> · Fri, 16 Apr 2010 02:01:35 +0400

Jan Kara <jack@xxxxxxx> writes:

> On Wed 14-04-10 18:33:30, Dmitry Monakhov wrote:
>> Jan Kara <jack@xxxxxxx> writes:
>> 
>> > On Wed 14-04-10 15:19:47, Dmitry Monakhov wrote:
>> >> I've finally automated my favorite testcase (see attachment), 
>> >> before i've run it by hand.
>> >> And sometimes i've saw following complain from fsck:
>> >> fsck.ext4 -f -n /dev/sdb2
>> >> ...
>> >> Pass 5: Checking group summary information
>> >> Inode bitmap differences:  -93582
>> >> Fix? no
>> >> 
>> >> Free inodes count wrong for group #12 (4634, counted=4633).
>> >> Fix? no
>> >> 
>> >> Free inodes count wrong (35610, counted=35609).
>> >> Fix? no
>> >> ...
>> >   Interesting. So some inode is marked as free although it is in
>> > use, right? That sounds like a nasty bug - if you reproduce this
>> > again, could you use debugfs to find out what file type is that
>> > inode? It could help looking for the bug.
>> No problems, 
>> wget http://download.openvz.org/~dmonakhov/junk/sdb2-2.bz2
>> In fact i've had even better image (with only 1 free inode in a
>> group, but full bitmask) unfortunately i forgot to save it.
>   I've looked at it: So the problem is the other way around (I always
> confuse this). The inode is properly deleted but the bit remains set
> in the bitmap. What is strange is that group descriptor counts are
> correct so it's really only the bitmap bit that is wrong. I've looked
> through the inode allocation and freeing code back and forth but I could
> not find a place where this could realistically happen.
>   So just for record:
> Inode has mtime = ctime = atime = dtime (so it was really deleted), i_nlink
> = 0, it is a directory, i_disksize = 4096, i_blocks = 0. So indeed it looks
> that we were in ext4_mkdir, we failed to allocate the block for directory
> and went to out_clear_inode (thus i_disksize remained to be set to 4096,
> otherwise it would be set to 0)... But how it happened that the bit in the
> bitmap didn't get cleared while the group descriptors were updated is
> beyond me.
>   Alternatively the inode could have been deleted just fine and later we
> just set the bit in the inode bitmap and didn't update anything else. But
> even this does not seem to be possible to me...
>   Since you can reproduce it, good first step would be to 
I will, but for now i'm working on fix for OOPS
from fs/ext4/extents.c:3479  due to ex == NULL
I'll create new bug in bugzilla for this in a minute.
>
>> >> I've started to look an inode bitmap manipulation code paths
>> >> and found strange logic in ext{3,4}_free_inode functions
>> >> 
>> >> 1) Group lock acquired twice for bitmap and for group_desc.
>> >>    There are not any advantage from this double locking, only
>> >>    error path(where the bit is already cleared) takes an
>> >>    advantage from this locking schema.
>> >>    It is reasonable to batch it in to one locking block.
>> >   I guess you think that this happens because we pass the lock parameter
>> > to ext3_clear_bit_atomic. But if you would actually look at the definition
>> > of the function, you would see that it's hard to find an architecture that
>> > uses the lock. Most architectures just use atomic bitop to clear the bit.
>> > I actually fail to see why anyone would need the lock - probably Ted knows
>> > :).
>> >
>> >> 2) if we failed to read gdp then bh2 is undefined so
>> >>    may result in oops due to undefince pointer dereferance.
>> >   No, because during mount time we check that all gdp pointers exist so
>> > ext3_get_group_desc can never fail after the mount has succeeded.
>> Yes, that is right,  why we have to check gdp to NULL when?
>   Hmm, I've looked at the code again and I think the check is there mainly
> to avoid Oops in case filesystem got corrupted and we computed some bogus
> group number. Not that I would see how that could happen in this particular
> case but in some other uses of ext3_get_group_desc it could happen. So
> moving the gdp check before we use bh2 probably makes some sence (although
> it's probably just a style cleanup in this case).
Ok, if we know that any error result in EIO or panic when let's just
call it style cleanup(simplification), imho new code is more readable.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html