Silent Corruption on RAID5

Michael Barnwell <xterminate@xxxxxxxxxxxxxxxx> · Sun, 22 Jan 2006 16:44:37 +0000

Hi,

I'm experiencing silent data corruption on my RAID 5 set of four 400GB 
SATA disks.

I first had the problem a couple of weeks ago and thought it was related 
to using reiserfs on my system because I hadn't used it before but have 
another perfectly functional RAID 5 array running ext3 after lots of 
testing I find the problem happens with ext3 on the array as well, and 
after even more testing I find that the problem only occurs on the array 
not the individual hard disks.

My test consists of making a ~10GB file of zeros, then checking it for 
non-zero bytes, I've also tried creating the file of zeros on a 
functional array and copying it across with the same results.

dd bs=1024 count=10000k if=/dev/zero of=./10GB.tst
od -t x1 s0/10GB.tst

These commands give me one row of zeros on my other RAID 5 set on the 
same box and on each individual hard disk in the array when I put ext3 
on them all to see if one was faulty but when they are in RAID the od 
spouts lots of non-zeros at me.

<snip>
21524747740 00 00 00 00 00 00 00 00 00 00 00 00 00 50 5c 36
21524747760 00 10 00 00 00 00 a7 23 00 10 00 80 00 00 00 00
21524750000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
21525147740 00 00 00 00 00 00 00 00 00 00 00 00 00 50 5c 36
21525147760 00 10 00 00 00 00 a7 23 00 10 00 80 00 00 00 00
21525150000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
<snip>

There is a fair bit of that and the last time I ran it I got an I/O 
array and the mount went into read only mode.

I figure the problem is either with the Silicon Image 3114 hardware or 
driver that supports the array or the RAID subsystem but like I 
mentioned earlier my other RAID 5 set of three 120GB drives on two IDE 
controllers works fine.

I'm running Debian sarge with a 2.6.15-1 kernel, it has an Athlon 
XP2200, 1GB of RAM, Asus A7N8X-Deluxe motherboard, 2 Maxtor IDE 
controllers, one Silicon Image 3114 PCI adapter, along with the on-board 
Silicon Image 3112 controller - 2x 10GB IDE disks and a DVD ROM drive on 
the on-board IDE controller, 3x 120GB Seagate hard disks on the PCI IDE 
adapters, 2x 80GB Seagate disks on the on-board SilImg 3112 controller 
and finally 4x 400GB disks on the SilImg 3114 PCI adapter.

biggs:/mnt/test/s0# uname -a
Linux biggs 2.6.15.1.060121 #1 Sat Jan 21 17:01:30 GMT 2006 i686 GNU/Linux
biggs:/mnt/test/s0# cat /proc/mdstat
Personalities : [raid1] [raid5]
md2 : active raid5 sdd1[4] sdc1[2] sdb1[1] sda1[0]
      1172126208 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [=>...................]  recovery =  8.3% (32500608/390708736) 
finish=253.3min speed=23564K/sec

md1 : active raid5 hdg1[0] hde1[2] hdi1[1]
      234436352 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md0 : active raid1 hdb2[0] hda2[1]
      9502336 blocks [2/2] [UU]

unused devices: <none>

(Note: md2 is the array with problems, and I've done the tests when its 
been fully synced with the same results)

So, does anyone have any suggestions or tests I could perform to narrow 
down where my problem is?

Regards,

Michael Barnwell.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html