RE: MD drivers problems on 2.4.x

"Riley Williams" <Riley@Williams.Name> · Mon, 26 May 2003 09:17:28 +0100

Hi Michael.

Unfortunately, I'm no RAID expert - I've never been in a position to
use RAID myself. As a result, I'm unable to help with your problem.

I've CC'd this to the Linux-RAID reflector where people who can help
you are likely to be found...

Best wishes from Riley.
---
 * Nothing as pretty as a smile, nothing as ugly as a frown.

 > -----Original Message-----
 > From: Michael Daskalov [mailto:MDaskalov@technologica.biz]
 > Sent: Monday, May 26, 2003 8:39 AM
 > To: Riley Williams
 > Subject: RE: MD drivers problems on 2.4.x
 >
 > Hi,
 >
 > I had similar problem with Kernel 2.4.18 from SuSE.
 > I have 4x120GB IDE disks from IBM/Hitachi.
 >
 > I've setup raid 5 on 4 disk (/dev/hda7, /dev/hdb7, /dev/hdc7,
 > /dev/hdd7). I've also setup raid1 (/dev/hda1, /dev/hdc1) and
 > some other raid1 md devices.
 >
 > The system was running fine for a week but suddenly /dev/hdd
 > gave me an error while reading one single block (on /dev/hdd5).
 >
 > I was trying to read it with  dd if=/dev/hdd7  and it failed
 > every time. I tried to read it with dd if=/dev/hdd and I
 > succeeded.
 >
 > I rebooted the computer with some Hitachi test diskette, and
 > tested the HDD in Full mode.
 >
 > It says there were no errors at all. I am not a S.M.A.R.T.
 > expert but I think if it was really hardware error, it would
 > be reported in SMART's error log.
 >
 > Then I thought, OK - the drive should be fine, I'll just
 > 'raidhotadd' it to the array and everything will be fine.
 > It almost suceeded. While reconstructing the array /dev/had
 > gave a similar error, but it was on another LBA sector.
 >
 > So, my whole raid5 was gone. Luckily it was still in test-only
 > mode.
 >
 > I switched to vanilla 2.4.20 which I self compiled, and it is
 > running fine for now.
 >
 > I just know I shouldn't count on this too much, but do I have
 > an alternative?!?
 >
 > Here are the error from /var/log/messages:
 >
 > May  6 00:15:16 devsrv kernel: hdd: dma_intr: status=0x51 {
 >		DriveReady SeekComplete Error }
 > May  6 00:15:16 devsrv kernel: hdd: dma_intr: error=0x40 {
 >		UncorrectableError }, LBAsect=24145681, high=1,
 >		low=7368465, sector=32048
 > May  6 00:15:16 devsrv kernel: end_request: I/O error, dev 16:47
 >		(hdd), sector 32048
 > May  6 00:15:16 devsrv kernel: raid5: Disk failure on hdd7,
 >		disabling device. Operation continuing on 3 devices
 > May  6 00:15:16 devsrv kernel: md: updating md2 RAID superblock
 >		on device
 > May  6 00:15:16 devsrv kernel: md: (skipping faulty hdd7 )
 >
 > .....
 >
 > May  7 14:42:58 devsrv kernel: hda: dma_intr: error=0x40 {
 >		UncorrectableError }, LBAsect=38454698, high=2,
 >		low=4900266, sector=14341032
 > May  7 14:42:58 devsrv kernel: end_request: I/O error, dev 03:07
 >		(hda), sector 14341032
 > May  7 14:42:58 devsrv kernel: raid5: Disk failure on hda7,
 >		disabling device. Operation continuing on 2 devices
 > May  7 14:42:58 devsrv kernel: md: updating md2 RAID superblock
 >		on device
 > May  7 14:42:58 devsrv kernel: md: hdb7 [events:
 >		00000046]<6>(write) hdb7's sb offset: 30724160
 > May  7 14:42:58 devsrv kernel: md: recovery thread got woken up ...
 > May  7 14:42:58 devsrv kernel: md2: no spare disk to reconstruct
 >		array! -- continuing in degraded mode
 >
 > .... And then
 >
 > May  7 14:42:58 devsrv kernel: md: recovery thread finished ...
 > May  7 14:42:58 devsrv kernel: md: hdc7 [events:
 >		00000046]<6>(write) hdc7's sb offset: 30724160
 > May  7 14:42:58 devsrv kernel: md: (skipping faulty hda7 )
 > May  7 14:43:02 devsrv kernel: vs-13070: reiserfs_read_inode2:
 >		i/o failure occurred trying to find stat data of [144691
 >		148158 0x0 SD]
 > May  7 14:43:02 devsrv kernel: vs-13070: reiserfs_read_inode2:
 >		i/o failure occurred trying to find stat data of [144691
 >		148158 0x0 SD]
 > May  7 14:43:02 devsrv kernel: vs-13070: reiserfs_read_inode2:
 >		i/o failure occurred trying to find stat data of [144691
 >		208720 0x0 SD]
 > May  7 14:43:02 devsrv kernel: vs-13070: reiserfs_read_inode2:
 >		i/o failure occurred trying to find stat data of [144691
 >		208720 0x0 SD]
 >
 > Here is some very ugly bug IMHO. See:
 >
 > May  7 14:42:58 devsrv kernel: raid5: Disk failure on hda7,
 >		disabling device. Operation continuing on 2 devices
 > 
 > How can a raid5 array built on 4 devices to operate with 2 devices ?
 > It should rather stop and do nothing ....
 >
 > The after reboot I did
 > mkraid --dangerous-no-resync --force /dev/md2
 > I was able to mount the reiserfs 3.6 format. I was able to read
 > some data from some files, But it was very heavily corrupted.
 > The Oracle installation that was there was useless (sqlplus gave
 > Segmentation faults), and so on.

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.483 / Virus Database: 279 - Release Date: 19-May-2003

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html