files disappear in ext3 filesystem

tytso@mit.edu (Theodore Tso) · Fri, 18 Jan 2002 01:50:11 -0500

On Thu, Jan 17, 2002 at 10:59:42PM -0700, Andreas Dilger wrote:
> 
> This would indicate that the filesystem is mounted as ext2 and not ext3.
> What does "tune2fs -l /dev/<device>" say about this device?  What does
> /proc/partitions show?

A very commong problem with RedHat installs seems to be that people
don't get the initrd right (and ext3 is loaded as a module), so the
root filesystem ends up getting mounted as ext2, even though a journal
has been put on the device.  So "tune2fs -l /dev/XXXX" isn't the right
test.  The right test is check out /proc/mounts and make sure it
really is getting mounted as ext3 and not as ext2.  I'm guessing the
problem is that it was mounted as ext2.

> > Jan 14 09:22:33 peanut fsck: /: 143780/7241728 files (1.1% non-contiguous), 986610/14458492 blocks 

This propensity for most RedHat users to screw up the initrd and cause
the root to be mounted ext2 is made even worse because most newbiew
users make the mistake of creating a single large root filesystem.
(14 gigs, in this case).  

Bad move....  personally, I use a 128 meg root partition, and then
create /usr and /var partitions, and so even though my root partition
is still ext2, I rarely have a problem with unclean shutdowns, since
the root partition is rarely modified even though it's mounted
read-write.  (/tmp is a symlink to /var/rtmp, and in some cases /var
is a symlink to /usr/var, if I only want to have a / and /usr
partition)

> While the number of error messages for the "mikelee" log is a bit high, both
> the "i_blocks is X, should be Y" and "inode N has zero dtime" errors are not
> unusual for an ext2 filesystem that was in use when it crashed.  What they
> mean is (likely) that these inodes were being written to at the time the
> systems lost power, and there is nothing ext2 can do about this.

I agree, that's likely the cause.  Chalk this up to crappy hardware;
what happens is that the +5 volt line from the power supply drops
faster than the +12 volt line, and in any case, memory tends to be
suffer the most in low voltage situations.  So in a power-fail
situation, the memory goes insane first, but there's enough voltage
for the DMA engine, disk controller, and disk drive to continue, such
that garbage is written to the disk.

In higher quality hardware, there is a power-fail interrupt which can
be used for the hardware to abort all DMA's that are in process before
things go to hell, but most PC-class hardware doesn't have this
feature.  (Now that I'm at IBM, maybe I can agitate to change this;
although IBM has 400,000 employees, and I'm in a different department
from the ones who make hardware.  :-)

The other workaround is to get a UPS, and then monitor the "low
battery" indication from the UPS via the RS-232 interface, and so that
you can do a graceful shutdown when the UPS reports that it's running
low on batteries.  (This is basically an expensive replacement for a
power-fail interrupt.)

Using ext3 will usually help, since it's very likely that the blocks
that get end up getting written as trash are on the journal, and so
the garbaged blocks will get replayed onto the disk from the journal.
So that will generally keep you out of trouble, even without adding
the UPS.

Note though ext3 helping you with crappy hardware doesn't carry over
to other filesystems that use journals, such as reseirfs and xfs.
Ther reason for that is that they do logical journalling, and not
physical block journaling.  So when part of the inode table is
modified, what is represented in the journal is a logical
representation of the change.  For example, a series of bytes which
indicate, "set the mod-time of the inode to XXXX".  However, this kind
of logical logging, while space efficient since you don't write out
the entire block, does mean that if the original block is written out
as garbage, there isn't enough redundant data in the journal to
reconstructed the garbaged block.

						- Ted