Re: help about ext3 read-only issue on ext3(2.6.16.30)

qixuan wu <wuqixuan@xxxxxxxxx> · Wed, 5 Dec 2012 23:51:07 +0800

On Wed, Dec 5, 2012 at 10:26 PM, Tao Ma <tm@xxxxxx> wrote:
> On 12/05/2012 06:43 PM, Li Zefan wrote:
>> On 2012/12/4 23:09, Theodore Ts'o wrote:
>>> On Tue, Dec 04, 2012 at 09:54:05PM +0800, Li Zefan wrote:
>>>>
>>>> I've collected some logs in different machines, and the error was always
>>>> triggered in ext3_readdir:
>>>>
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>>>> EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>>>
>>> This looks like the last part of the inode was zapped.  It might be
>>
>> I don't think so. See below...
>>
>>> worth adding a kernel patch which dumps out the entire directory block
>>> as a hex dump when this triggers --- and then compare it to what you
>>> get if you dump the directory back out after the machine reboot.  That
>>> might given you a hint if something is corrupting the directory block
>>> in memory.  (especially if you set the remount read-only option).
>>>
>>>> The last two errors happened on the same machine, and the same inode! One
>>>> happened in 11/22 (I was told they had run fsck later on), and one in 12/01.
>>>
>>> If it's always the same inode, you might want to correlate based on
>>> the pathname.  Is there any commonality accross multiple machines in
>>> terms of the directory name, and what application(s) might be touching
>>> that directory?
>>>
>>
>> I found this in one log:
>>
>> Nov 14 05:26:55 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>> Nov 14 13:42:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0
>> Nov 16 17:29:40 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0
>> Nov 23 19:42:44 kernel: EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #7225391: rec_len is smaller than minimal - offset=3952, inode=0, rec_len=0, name_len=0
>>
>> Happend 4 times, the same inode, different offsets. Another log showed the
>> same pattern.
>>
>> They said they ran fsck everytime this happened. Many machines got this problem,
>> but they remember most of the time fsck didn't report error.(*)
>>
>> I've checked the pathname, and they all points to log dirs. There're 2 kinds
>> of log dirs with different loggers, but seems work similarly.
>>
>> Except one bug report, all others point to exactly the same log dir.
>>
>> There're two processes that will touch this dir. One is a monitor, it will
>> delete old logs if they occupy too much space, but normally this shouldn't
>> happen.
>>
>> Another is the logger. When it wants to log sth, it scans the directory, if
>> there're more than 100 log files, it will delete the oldest one. After writting
>> to the current log file, if the file is larger than 8M, this file will be
>> renamed as a backup log. I haven't read the code yet. But sounds pretty
>> simple, right?
>>
>> The length of the file name is 25. There were 35 logs dating from 2012/11/02
>> to 2012/11/23, and no pending deleted files. Thus the remaining ~2.8K of the
>> dir block is never used, so I don't think something zeroed it because it
>> has always been zero.
> Only 35 files? So there should be no rename. And the only possible
> action we do to this dir is "create a new log file", right? Then, I
> really don't think ext3 will error in such a simple test case. :(
>
>>
>> This log dir is new in this version, while the other one also exists in
>> old verison, with less IO.
> You mean the kernel version? Sorry, but what do you want to tell us here?

Here is the user-space app version. In the new user-space app version,
this file op model is used and the problem is coming.

Thanks
wuqixuan

> Thanks
> Tao
>>
>> (*) They have machines in different spots. In another spot, 5 out of ~30
>> machines met this problem after upgrading, and fsck reported errors in
>> all of them. However there were just a few errors, and they didn't seem to
>> relate to the directory, which means the directory seems intact. Adding
>> that the fs was created nearly 1 years ago and ever fscked, those errors
>> might have nothing to do with this bug?
>>
>> btw, the version of e2fsprogsis: e2fsck 1.38 (30-Jun-2005)
>>
>> Regards
>> Li Zefan
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html