data corruption - the nightmare continues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Since quite some time one of our servers running redhat linux 7.1
(SMP, SCSI) and raid1 on two identical SCSI disks is giving me
nightmares (see my mail from Nov 14th, 2001):

after some time read errors like the following occur, causing the
raid to get out of sync.

Additional sense indicates Unrecovered read error
 I/O error: dev 08:19, sector 12850360
raid1: sdb9: rescheduling block 12850360
md: recovery thread got woken up ...
md4: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...

raidhotremove/raidhotadding the faulty partition works to get 
the raid in sync again, but these errors keep occuring faster
and faster since at one point in time it's nearly impossible
to sync the raid again.

what we already did to overcome this error:

- performed RAM checks
- replaced the "faulty" disks with new ones
- replaced the scsi controller (new one has
  a newer bios release)
- replaced scsi cabling
- checked disks with vendor programs
- checked CPU temp. & fan speed
- reduced transfer rate on scsi bus

needles to say that the checks didn't find any errors. strange
thing is that after such an replacement action the system
works fine for a while, but after some weeks the problem
starts all over again:

[rfu@host tmp]$ cp /tmp/IBMJava2-SDK-13.tgz .
[rfu@host tmp]$ diff /tmp/IBMJava2-SDK-13.tgz ./IBMJava2-SDK-13.tgz 
[rfu@host tmp]$ diff /tmp/IBMJava2-SDK-13.tgz ./IBMJava2-SDK-13.tgz 
Binary files /tmp/IBMJava2-SDK-13.tgz and ./IBMJava2-SDK-13.tgz differ
[rfu@host tmp]$ diff /tmp/IBMJava2-SDK-13.tgz ./IBMJava2-SDK-13.tgz 
Binary files /tmp/IBMJava2-SDK-13.tgz and ./IBMJava2-SDK-13.tgz differ
(but no disk R/W error reported by kernel)
[rfu@host tmp]$ 
(current directory is in /home/rfu/tmp which is located on a 
different partition than /tmp)

might it be a bug in the disk caching subsystem of the kernel ?
strange that the first diff works, but the subsequent ones don't.

I meanwhile think that I'm in the wrong list here since I guess
that it is not the fault of the raid subsystem; I'd be glad if
anyone could point me to a suitable place for my problem (or
get rid of it).

tnx in advance.

rainer.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux