Re: Any hope for a 27 disk RAID6+1HS array with four disks reporting "No md superblock detected"?

Thomas Baker <tjb@xxxxxxx> · Thu, 5 Feb 2009 19:08:09 -0500

On Feb 5, 2009, at 6:57 PM, Bill Davidsen wrote:

Thomas J. Baker wrote:
On Thu, 2009-02-05 at 13:49 -0500, Bill Davidsen wrote:

Thomas J. Baker wrote:

The array was made probably two years ago and had been working fine
until recently. In reading the documentation for mdadm, it did  
seem like
it should have required me to use the higher version but it never
complained when I made it and worked fine.

What have you changed lately? Are the drives all on a single  
controller? Are you using PARTITIONS in mdadm.conf and letting  
mdadm find things for itself?

The array is made up of two Dell PowerVault 220s in split bus
configuration with two Adaptec 39160 Dual Channel SCSI controllers.  
Each
half of each PowerVault (7 disks) is connected to one of the  
channels on
the Adaptecs. Four channels in all.

As far as changing things, what do you mean? The cause of the  
failure is
likely heat as we've had some AC issues recently.

Well that's change, but if you can read the drives at all it doesn't  
sound like the typical "fall down dead" heat issues, I would expect  
tons of hardware errors at a lower level from the device controller.  
Did you check the partition tables with fdisk or similar?  Are the  
drives all in the same physical box? IBM split their boxes, running  
four drives off one power and four (or three+CD) off the other. They  
are likely to have something in common, if you can find it you might  
fix it.

I didn't use mdadm.conf at all. All disks are partitioned with one
'Linux raid autodetect' partition.  mdadm had always found the array
automatically at boot.

No kernel update or utilities update lately?

Given the choice of identify in hopes of a fixable problem or  
reinstall, config, recover from backup, I'm trying to see if you can  
do the former in preference to the latter.

Fdisk reports all drives look OK as far as partition table and  
partition type. I'm in the process of running a media verify from the  
Adaptec BIOS on each of the four to make sure nothing is really wrong  
with them. The PowerVaults house 14 drives so we have two boxes. A  
PowerVault is just a box for disks, essentially an external SCSI  
enclosure. As far as I can tell, the hardware seems fine now that the  
AC is fixed.

I did do a software update after the failure in hopes of it helping,  
which likely updated the kernel since it had been a month or two on  
that machine. CentOS5 so nothing major should have changed in terms of  
versions, etc.

The research group that uses the array is hoping for a fixable problem  
too as opposed to the longer remake/restore route. The only hope to me  
seems to be if mdadm can somehow recover/remake the md superblock on  
the four troublesome disks.

Thanks,

tjb
--
=======================================================================
| Thomas Baker                                  email: tjb@xxxxxxx    |
| Systems Programmer                                                  |
| Research Computing Center                     voice: (603) 862-4490 |
| University of New Hampshire                     fax: (603) 862-1761 |
| 332 Morse Hall                                                      |
| Durham, NH 03824 USA              http://wintermute.sr.unh.edu/~tjb |
=======================================================================

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html