Re: RAID-6 mdadm disks out of sync issue (more questions)

Bill Davidsen <davidsen@xxxxxxx> · Mon, 15 Jun 2009 11:48:33 -0400

linux-raid.vger.kernel.org@xxxxxxxxxxx wrote:
This doesn't make a lot of sense.  It should not have been marked
as a spare unless someone explicitly tried to "Add" it to the
array.

However you description of event suggests that this was automatic
which is strange.

Yes, it was entirely automatic.  The only commands I had running on the computer when it happened were:

# watch -n 0.1 'uptime; echo; cat /proc/mdstat|grep md13 -A 2; echo; dmesg|tac'

This gave me a nice, simple display of what was going on with the
rebuild, and a monitor of dmesg in case there were any new kernel
messages.

Can I get the complete kernel logs from when the rebuild started
to when you finally gave up?  It might help me understand.

Sure.

Just to confirm, /dev/sd{a,b,c,d,e,f}1 are the partitions which
contain my up-to-date data.  /dev/sd{i,j}1 contain many days old data.

Here is the entire dmesg output during the rebuild:

I left it running for about an hour, and none of the disks had any errors.
I really hope it is not a permanent fault 75% of the way through the disk.
Though if it was just bad sectors, why would the disk be disconnecting
from the system?

Thanks again for all your help.

I really don't see any indication that this is a kernel issue, my VM 
host machine has multiple VMs, including this "desktop" system, and runs 
raid5 and raid10, and has had no "ata" messages in 15 days of uptime, 
obviously with lots of disk use. The only thought I do have is that it 
is at least possible that you have a marginal something in your 
hardware, possibly memory, or a controller, and that two things which 
might be useful to check are the memory (memtest) and using 'sensors' to 
monitor heat. I have seen drives which worked fine until you ran them 
hard for 20-30 minutes and then started getting errors (usually seek). 
Just a few things to consider, since you have put this much effort into 
characterizing the problem.

--
Bill Davidsen <davidsen@xxxxxxx>
 Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one error occurs during
wildcard (glob) expansion.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html