bizarre data corruption...

"A. James Lewis" <james@xxxxxxxxxx> · Sat, 11 Dec 2004 16:27:14 -0000 (GMT)

I have a little story to tell, which I hope will ring a bell with some of
you folk, perhaps somone knows how this could happen, or has had it happen
to you!

A little while ago, I built a box to host a large raid array, 8x200Gb
ATA100 disks... I first set it up as RAID5 and 2 disks failed in the same
day I'm cursed!.. So on the second attempt I used RAID6 (With the patch
which was posted here a few weeks ago to avoid corruption when partially
degraded).

All was well, or so it seemed, and I had a lovely new array of 1.14Tb to
store files on...

But when populating with files from various sources, I noticed that some
files which should show up as duplicates were not, and on further
investication I discovered that about once per 1Gb written to the array, 1
bit was inverted when the file was read back...

I syspected the machine, since it was a Dual Athlon 2200+ built from parts
I had left over from other projects, and it was in need of a BIOS upgrade
despirately... So I bought a nice new Supermicro/Serverworks XEON board
and transplanted that...

I noticed that now, in the logs I was getting occasional DMA status errors
on 1 of the busses... So I thanked my lucky stars that the Serverworks
shipset/driver had spotted what the AMD chipset did not, and replaced
those 2 disks with spares...   After the resync fsck was a complete mess,
and I decided that the best thing to do would be to try and get an easily
repeatable scenerio where the problem could be demonstrated...

The machine has 4 IDE busses and 8 drives... 2 onboard and 2 using HPT371
single channel IDE cards...  each drive has a 1Gb swap partition... so I
turned the swap off... and wrote random data from /dev/urandom to the
first swap partition, and then back into a file... (So as to get the size
right).. This file was then written to each of the swap partitions and
those were then checked with cksum... 10 times... no errors.

Next I tried writing the file to all the swap partitions concurrently...
same result, no corruption...

So I tried reading the data back from each drive concurrently... this
yielded a consistently corrupted file (only by a few bytes, but still
corrupted)... from the disks on the secondary onboard IDE bus.

At this point, I have changed the motherboard, the disks, the cable...
well, ALL the hardware.... I eventually solved the imediate problem with
another HPT371 controller for that bus... but it seems that anything on
the secondary onboard controller will be randomly corrupted in a raid
configuration where all the drives are read at the same time...

This happened with 2 motherboards using 2 different chipsets, but ALL the
hardware was changed, even the brand of the replacement disks was
different... so I am left wondering if this could be a kernel bug??  I
can't say for 100% sure if the problem was identical before with the
previous motherboard, but the chances of getting 1 or 2 bits corrupted per
Gb of data  using 2 completely different motherboard/disk combinations
(even the PSU was changed!) is pretty remote, leaving me to ask if it
could have been the kernel?

It would have to be common to the Serverworks, and AMD 760 MP chipsets...
which seems unbelievably unlikley, but I am running out of ideas..

Otherwise, I love RAID6... I would have lost a whole lot of data over this
if I'd still been using RAID5...

-- 
¯·.¸¸.·´¯·.¸¸.-> A. James Lewis (james@xxxxxxxxxx)
http://www.fsck.co.uk/personal/nopistons.jpg
MAZDA - World domination through rotary power.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html