Re: Ext3 strangeness data loss

"Theodore Ts'o" <tytso@mit.edu> · Tue, 4 Feb 2003 12:55:25 -0500

On Tue, Feb 04, 2003 at 04:43:17PM -0000, Bodrogi Viktor wrote:
> > > Do You know about if there is a mode switch for RAID-1 setup (my case is
> > > evms-raid) to do this comparision?
> > > This makes sense as an option for debuging and for high availability
> > > production also.
> > 
> > No there isn't, on any RAID systems that I'm aware of.
> 
> This really breaks my confidence in RAID-1 mirrors.
> Would the situation get better with a four disk RAID-5?
> As I imagine, it should...

Nope.  RAID-5 has a "parity stripe", yes, but it's not used to protect
against errors.  It's used to rebuild the RAID array after a disk
failure.  Requesting two blocks from two different disk drives would
require extra memory (you need a place to store the extra disk block),
consume memory bandwidth and CPU time to do the block compare, and
increase overall latency (since you have to wait for both disk blocks
to be received and compared before the user application can touch the
page).  I don't know of any RAID system that has been willing to
design in the extra complexity, even as a "debugging" option.

Keep in mind that the RAID design comes from high-end systems where
performance is emphasized, and the only thing that required protection
was the outright failure of the disk drive itself.  Things like CRC or
other checksums were presumed to protect against data errors.

In your particular case, where you told us that you were seeing data
from other files appearing in the wrong place, my guess is that it's
the actual block address which is getting corrupted, not the data
being downloaded.  If I recall correctly, IDE UDMA protects the data
blocks being transferred using a CRC, but I don't believe the IDE
command block itself is protected, and that's probably how you're
getting screwed; if that gets corrupted, then the disk drive will send
the wrong disk block back in response to a read request.

> If this phenomena is HW error, should it be logged anywhere?
> I didn't find anything in syslog...

Well, if it is a corrupted block/sector number, it won't get logged
because the HW isn't noticing that something has gone wrong.  It would
be odd, though, that just the request address was getting corrupted
and nothing else would be, if it were a cable fault.  A some kind of
weird fault in the controller or the disk drives themselves might
explain these results, though.

							- Ted

_______________________________________________

Ext3-users@redhat.com
https://listman.redhat.com/mailman/listinfo/ext3-users