On Tue, Dec 04, 2012 at 09:54:05PM +0800, Li Zefan wrote: > > I've collected some logs in different machines, and the error was always > triggered in ext3_readdir: > > EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #6685458: rec_len is smaller than minimal - offset=3860, inode=0, rec_len=0, name_len=0 > EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #9650541: rec_len is smaller than minimal - offset=3960, inode=0, rec_len=0, name_len=0 > EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #11124783: rec_len is smaller than minimal - offset=4072, inode=0, rec_len=0, name_len=0 > EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4024, inode=0, rec_len=0, name_len=0 > EXT3-fs error (device sda7): ext3_readdir: bad entry in directory #52740880: rec_len is smaller than minimal - offset=4084, inode=0, rec_len=0, name_len=0 This looks like the last part of the inode was zapped. It might be worth adding a kernel patch which dumps out the entire directory block as a hex dump when this triggers --- and then compare it to what you get if you dump the directory back out after the machine reboot. That might given you a hint if something is corrupting the directory block in memory. (especially if you set the remount read-only option). > The last two errors happened on the same machine, and the same inode! One > happened in 11/22 (I was told they had run fsck later on), and one in 12/01. If it's always the same inode, you might want to correlate based on the pathname. Is there any commonality accross multiple machines in terms of the directory name, and what application(s) might be touching that directory? > Yesterday they upgrade apps on ~30 machines, and soon after that 5 machines > had filesystem corrupted. However they won't stop upgrading other machines! > > On the other hand, we can hardly reproduce this bug in the lab. This is why wise cloud companies have a (figurative) big red button to stop upgrade rollouts (which are always done slowly and gradually), and processes which make it relatively easy for engineers to be able to push the "big red button". I seem to recall the operations engineer at Facebook giving a talk where he mentioned this. :-) Good luck! Sorry, the pattern of corruption really doesn't sound familiar to me... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html