Re: Ext3 strangeness data loss

"Bodrogi Viktor" <viktor@neotek.hu> · 4 Feb 2003 12:47:06 -0000

Hi!

This morning I booted and, what a horror, found bad superblock on /var!
fsck -ing reported nothing, but mount said bad superblock.
It's the best can happen after due day of project, but before finishing it,
isn't?
So I decided to switch to reiserfs, which has performance advantages too.
After about fifth reboot I could mount /var, and copied it to a new
partition together with root partition.
And, terrible, I had the same problem with /usr/sbin/sshd startup, without
the binary changes, according to a diff with a probably-good backup (who can
be sure about after all these...).

So the conclusion is that pssibly this has nothing to do with ext3.
It's not openssh because I had problems with other files/dirs, too...
Maybe it's evms?
Maybe it's the kernel?
It's a stock 2.4.19, only with evms and vserves patches.
I don't think it's a distro problem...

So sorry about talking about this on ext3 list!

Thanks for all help!

viktor

more comments below...

> > 
> > Seems interesting.
> > I forgot to mention (yes, sorry, it's important piece of information),
> > that I have RAID 1 (mirrored disks), so HW problem is less possible.
> > And I have reiserfs partition on the mirror too, without any problem.
> 
> Raid protects you against disk failures.  It does not protect you from
> cable problems causing data corruption, or your RAID controller going
> insane.  Unfortunately a lot of people seem to believe that just
> because they have RAID, they are immune from hardware problems, and
> then stop doing backups.  I usually hear from them after they've
> gotten screwed, and when they ask if I can perform miracles....

Yes, RAID is completly different than backup.
RAID doesn't protect you of rm -fr / ;))

> 
> In any case, the scenario I described (a controller/cable problem, or
> an incorrectly configured IDE DMA settings) are all still possible
> with RAID; RAID does not help you prevent these sorts of problems.

It's SW RAID-1, disks are on the same controller,
but different buses / cables.
Am I right, that in this case HW errors are *very* unlikely?
That would mean that there are exactly the same bits of errors at exactly
the same time on different cables/disks...

> As far as your not noticing the problem with reiserfs that could be
> because you've been lucky, and not noticed because the block addresses
> causing the problem do not (yet) contain data.  But the symptoms
> you've described sound very much like hardware induced errors.
> 
> > Anyway, do you have an idea how to test for HW errors?
> 
> Well, if you have a scratch partition that's not being used, you can
> try using the badblocks program.  Try using the -w option, which will
> do a read/write test.  This doesn't do a random access test, so it
> might not detect any problems, though.
> 
> I'd suggest checking your internal cabling, and replacing the
> controller cable if it looks dubious.  Making everything is well
> plugged in, too.
> 

I use the most expensive, twisted, shielded, etc. cables, plugged well, at
least visualy...

Thanks for all answers!

viktor

_______________________________________________

Ext3-users@redhat.com
https://listman.redhat.com/mailman/listinfo/ext3-users