On Tue, Jun 14, 2005 at 10:26:52PM -0400, David Shaw wrote: > On Tue, Jun 14, 2005 at 07:19:23PM -0400, Andreas Dilger wrote: > > On Jun 14, 2005 17:14 -0400, David Shaw wrote: > > > Jun 13 13:58:16 n202 kernel: EXT3-fs error (device sda5): ext3_get_inode_block: bad inode number: 9 > > > > > > This particular example is a SATA disk, but it has happened to a > > > regular old IDE disk as well. It is always the root partition. The > > > bad inode number varies (but is always either 3 or 9). There are no > > > other errors about the disk in the log. > > > > The "bad inode number" check is only for inodes inside the "reserved inode" > > area, namely inum < 12. The only commonly used (=valid) inode numbers in > > this range are the root inode (=2) and the journal inode (=8), so I suspect > > you are getting single-bit memory errors in bit 1, or if the controller > > is the same that would also be viewed with suspicion. It is very likely > > that you are getting other single-bit errors elsewhere but they are harder > > to notice. > > This is an interesting idea. Is there any simple way this sort of bit > flip problem could happen outside of the hardware? I've had this > happen on 4 different machines from 3 different vendors, 3 SATA, and 1 > IDE. It seems almost impossible that it's a memory or controller > error. I have to agree with Andreas' analysis. If you could, please send some compressed raw e2image dump files (see the man page for e2image, but basically we need is: "e2image -r /dev/sda5 - | bzip2 > sda5.e2i.bz2"), taken after the disk is remounted read-only. Then take another e2image dump after the system has rebooted in single user mode, but *before* running e2fsck on the filesystem. (That way we can check to see if the filesystem has changed between reboots --- that could indicate hardware problems, or in-memory corruption of the buffer cache due to some kernel bug.) The e2fsck transcript would also be useful, of course. The only other possible explanation I can imagine, beyond a hardware problem, or some strange kernel bug that no one else is seeing, is some a bug in some program that was directly accessing the disk drive; for example, if the bootloader attempted to update some state and wrote that state to the wrong place on disk, or some other program that was doing direct disk accesses, and it was always corrupting the same block(s) in the same way. Good luck, - Ted _______________________________________________ Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users