Re: EXT3-fs unexpected failure msg ?

Sev Binello <sev@xxxxxxx> · Tue, 18 Apr 2006 09:57:46 -0400

Andreas Dilger wrote:
On Apr 17, 2006  21:30 -0400, Sev Binello wrote:

Damian Menscher wrote:

On Mon, 17 Apr 2006, Andreas Dilger wrote:

You really, really, really need to mount your filesystem with
"-o errors=remount-ro", at least to prevent filesystem corruption.
I'm not sure if this is enough to prevent corruption in the case
of your RAID disconnects (if it doesn't generate errors up to the
filesystem, but still discards writes), but it is at least a minimum
requirement.

Since this was so strongly-worded, I just did a random spot-check of 
some of our filesystems (RHEL4) and discovered they all have:

 Errors behavior:          Continue

in the superblock (and mount apparently takes that option).  This makes 
me curious: if it's so obvious that it should remount-ro on errors, why 
is the default (on RHEL4, at least) to continue?

It was only so strongly worded because Sev has had repeated failures of
the RAID hardware resulting in filesystem corruption, and it seems prudent
to stop the filesystem at the first inkling of corruption in this case.
Not all environments see so many problems, and the choice to use remount-ro
is up to the admin (though I believe Debian uses this as the default).

my question/concern is that since there are sometimes trivial errors that 
we often have to live with until we can take our operational systems down
long enough to fsck, will this option automatically put us in ro mode no
matter how trivial the problem is ?

This will only trigger on cases where there is a consistency error detected
in the ext3 metadata.  It doesn't affect regular IO errors for file data.

Ok, I'm assuming this would be any error reported in /var/log/messages
that is preceeded by EXT3-fs

However, that said, it surprises me that you are getting any kind of errors,
even "trivial" ones, often.  I wouldn't consider a RAID system where you
often get errors to be very reliable.

No arguement from us.

Also, when we had the problem earlier today (i.e. the raid controller 
didn't failover for about 20 mins), we did stop and fsck.
But even so when we checked after it was done, it still said state was
"clean with errors" ?

When you run e2fsck, are you specifying the "-f" flag?  For ext3 filesystems,
an e2fsck (without -f) will normally not do a full filesystem check unless
the superblock has been flagged with an error.  This allows e2fsck to run
against the filesystem always at boot, but normally only do journal replay
(seconds at most) unless there was an error reported.

We tried fscking again with no better results,
though when it started it said...
     "ext3 recovery flag clear but journal has data"
any advice here ?

Run "e2fsck -f"?  I haven't seen this unless the superblock was corrupted
and had to be restored from backup or similar.

Will try it
Thanks

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

--

Sev Binello
Brookhaven National Laboratory
Upton, New York
631-344-5647
sev@xxxxxxx

_______________________________________________

Ext3-users@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/ext3-users