Cannot sync RAID5 anymore - MCE problem?

Ramin <ramin.t@xxxxxxxxx> · Mon, 13 Aug 2007 17:56:49 +0200

Hello everybody.

I have a strange and severe problem with my Raid-Array and I have to
contact "experts" before I can continue with a clear conscience.
I wanted to exchange 2 disks in my Raid5 array by two newer ones. So I
connected the new to my machine partitioned the correct layout added one
of them as a Spare and "--faild" the partition of the disk I wanted to
remove first. Rebulding of the array started immediately and finished fine.
Now I took the two old disks out and put the new ones in. By removing the
other disk from my array I degraded it. After booting I added the correct
partition of my new drive to the Raid5 and waited for the syncing to
finish ...
but it didn't. Is crashed my whole machine with an MCE:
CPU 0: Machine check exception: 4 bank 4 b20000000000070f0f
...
Kernel Panic - not synching: machine check

The reason why I write an MCE problem to the Software-Raid list is that
this problem is very reproducible and always happens when resyncing of my
array has finished 24.9%. I tried it about ten times so I am really sure
that there is some connection to resyncing since this problem does not
seem to appear under different conditions anymore.
I tried to do an rsync-backup of my raid-array which lead the the same
crash. After that I observed that this crash has occured when copying a
not so important Backup of something else. I deleted that old Backup and
since that my problem seems to ONLY occur if I try to resync my array.

I am running Gentoo on an AMD64 3200+ and K8N Neo4 Platinum and my problem
seems to be similar to the problems of these guys:
http://kerneltrap.org/node/4993
but somehow related to resyncing. I have reiserfs on my array and
successfully completed a "reiserfsck --rebuild-tree".
I think it is not important but it might be good to mention that I use
LVM, too.

I have also tried to resync the array to my old disk (with the second new
one removed), but that leads to the same problem.

I have tried several things like removing one RAM module or using
different RAM-Banks I checked for leaking caps I tried without DMA, tried
different kernels and played with some kernel options.

Is there a way to figure out what hardware seems to be the problem?
My hardware worked flawlessly for over 1.5 years if I did not break
something while physically mounting the disks or cleaning dust out of the
case it can only be a problem of the first new harddrive (which is
unfortunately part of my degraded raid-array already). Is it possible that
an SATA1 Cable on a SATA2 capable controller connected to a SATA2 capable
disk leads to such errors?

Since I was able to copy my data I think it is in perfect condition, but
there seems to be a problem on the array in the "empty"-part. Does anybody
know a way how to over- or rewrite the empty blocks of a
reiserfs-partition? Or some tool to find/correct disk-problems. (I tried
reiserfsck but that does not find anything)

What is the smartest way for my to proceed to get my degraded array
redundant again?
I could delete the whole array, try set it up identically again and recopy
the data, but if it is really a hardware problem that would be a waste of
time.

Thanks in advance ...
	Ramin
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html