MB> Give the output of these: MB> mdadm -E /dev/sd[a-z] ]# mdadm -E /dev/sd[a-z] mdadm: No md superblock detected on /dev/sda. mdadm: No md superblock detected on /dev/sdb. mdadm: No md superblock detected on /dev/sdc. mdadm: No md superblock detected on /dev/sdd. I assume that's not a good sign ?! sda was powered on and running after the reboot, a smartctl short test revealed no errors and smartctl -a also looks unsuspicious (see below). the drives are rather new. guess its more likely to be either a problem of the power supply (400W) or communication between controller and disk. /dev/sdd (before it was replaced) reported the following: Dec 20 07:18:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors Dec 20 07:48:53 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors Dec 20 08:18:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors Dec 20 08:48:55 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors Dec 20 09:18:53 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors Dec 20 09:48:58 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors Dec 20 10:19:01 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors Dec 20 10:48:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors (what triggered a re-sync of the array) # smartctl -a /dev/sda smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD15EADS-00R6B0 Serial Number: WD-WCAUP0017818 Firmware Version: 01.00A01 User Capacity: 1,500,301,910,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Dec 23 14:40:46 2009 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (40800) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 177 145 021 Pre-fail Always - 8133 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 5272 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 2 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 13 194 Temperature_Celsius 0x0022 125 109 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 5272 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. >>From the errors you show, it seems like one of the disks is dead (sda) MB> or dying. It could be just a bad PCB (the controller board of the MB> disk) as it refuses to return SMART data, so you might be able to MB> rescue data by changing the PCB, if it's that important to have that MB> disk. MB> As for the array, you can run a degraded array by force assembling it: MB> mdadm -Af /dev/md0 MB> In the command above, mdadm will search on existing disks and MB> partitions, which of them belongs to an array and assemble that array, MB> if possible. MB> I also suggest you install smartmontools package and run smartctl -a MB> /dev/sd[a-z] and see the report for each disk to make sure you don't MB> have bad sectors or bad cables (CRC/ATA read errors) on any of the MB> disks. MB> On Wed, Dec 23, 2009 at 3:50 PM, Rainer Fuegenstein MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: >> addendum: when going through the logs I found the reason: >> >> Dec 23 02:55:40 alfred kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen >> Dec 23 02:55:40 alfred kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 >> Dec 23 02:55:40 alfred kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >> Dec 23 02:55:40 alfred kernel: ata1.00: status: { DRDY } >> Dec 23 02:55:45 alfred kernel: ata1: link is slow to respond, please be patient (ready=0) >> Dec 23 02:55:50 alfred kernel: ata1: device not ready (errno=-16), forcing hardreset >> Dec 23 02:55:50 alfred kernel: ata1: soft resetting link >> Dec 23 02:55:55 alfred kernel: ata1: link is slow to respond, please be patient (ready=0) >> Dec 23 02:56:00 alfred kernel: ata1: SRST failed (errno=-16) >> Dec 23 02:56:00 alfred kernel: ata1: soft resetting link >> Dec 23 02:56:05 alfred kernel: ata1: link is slow to respond, please be patient (ready=0) >> Dec 23 02:56:10 alfred kernel: ata1: SRST failed (errno=-16) >> Dec 23 02:56:10 alfred kernel: ata1: soft resetting link >> Dec 23 02:56:15 alfred kernel: ata1: link is slow to respond, please be patient (ready=0) >> Dec 23 02:56:45 alfred kernel: ata1: SRST failed (errno=-16) >> Dec 23 02:56:45 alfred kernel: ata1: limiting SATA link speed to 1.5 Gbps >> Dec 23 02:56:45 alfred kernel: ata1: soft resetting link >> Dec 23 02:56:50 alfred kernel: ata1: SRST failed (errno=-16) >> Dec 23 02:56:50 alfred kernel: ata1: reset failed, giving up >> Dec 23 02:56:50 alfred kernel: ata1.00: disabled >> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: timing out command, waited 30s >> Dec 23 02:56:50 alfred kernel: ata1: EH complete >> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 >> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1244700223 >> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 >> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1554309191 >> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 >> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1554309439 >> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 >> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 572721343 >> Dec 23 02:56:50 alfred kernel: raid5: Disk failure on sda1, disabling device. Operation continuing on 3 devices >> Dec 23 02:56:50 alfred kernel: RAID5 conf printout: >> Dec 23 02:56:50 alfred kernel: --- rd:4 wd:3 fd:1 >> Dec 23 02:56:50 alfred kernel: disk 0, o:1, dev:sdb1 >> Dec 23 02:56:50 alfred kernel: disk 1, o:1, dev:sdd1 >> Dec 23 02:56:50 alfred kernel: disk 2, o:0, dev:sda1 >> Dec 23 02:56:50 alfred kernel: disk 3, o:1, dev:sdc1 >> Dec 23 02:56:50 alfred kernel: RAID5 conf printout: >> Dec 23 02:56:50 alfred kernel: --- rd:4 wd:3 fd:1 >> Dec 23 02:56:50 alfred kernel: disk 0, o:1, dev:sdb1 >> Dec 23 02:56:50 alfred kernel: disk 1, o:1, dev:sdd1 >> Dec 23 02:56:50 alfred kernel: disk 3, o:1, dev:sdc1 >> Dec 23 03:22:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >> Dec 23 03:22:57 alfred smartd[2692]: Sending warning via mail to root ... >> Dec 23 03:22:58 alfred smartd[2692]: Warning via mail to root: successful >> Dec 23 03:22:58 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data >> Dec 23 03:22:58 alfred smartd[2692]: Sending warning via mail to root ... >> Dec 23 03:22:58 alfred smartd[2692]: Warning via mail to root: successful >> Dec 23 03:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >> Dec 23 03:52:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data >> Dec 23 04:22:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >> Dec 23 04:22:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data >> Dec 23 04:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >> [...] >> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data >> (crash here) >> >> >> RF> hi, >> >> RF> got a "nice" early christmas present this morning: after a crash, the raid5 >> RF> (consisting of 4*1.5TB WD caviar green SATA disks) won't start :-( >> >> RF> the history: >> RF> sometimes, the raid kicked out one disk, started a resync (which >> RF> lasted for about 3 days) and was fine after that. a few days ago I >> RF> replaced drive sdd (which seemed to cause the troubles) and synced the >> RF> raid again which finished yesterday in the early afternoon. at 10am >> RF> today the system crashed and the raid won't start: >> >> RF> OS is Centos 5 >> RF> mdadm - v2.6.9 - 10th March 2009 >> RF> Linux alfred 2.6.18-164.6.1.el5xen #1 SMP Tue Nov 3 17:53:47 EST 2009 i686 athlon i386 GNU/Linux >> >> >> RF> Dec 23 12:30:19 alfred kernel: md: Autodetecting RAID arrays. >> RF> Dec 23 12:30:19 alfred kernel: md: autorun ... >> RF> Dec 23 12:30:19 alfred kernel: md: considering sdd1 ... >> RF> Dec 23 12:30:19 alfred kernel: md: adding sdd1 ... >> RF> Dec 23 12:30:19 alfred kernel: md: adding sdc1 ... >> RF> Dec 23 12:30:19 alfred kernel: md: adding sdb1 ... >> RF> Dec 23 12:30:19 alfred kernel: md: adding sda1 ... >> RF> Dec 23 12:30:19 alfred kernel: md: created md0 >> RF> Dec 23 12:30:19 alfred kernel: md: bind<sda1> >> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdb1> >> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdc1> >> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdd1> >> RF> Dec 23 12:30:19 alfred kernel: md: running: <sdd1><sdc1><sdb1><sda1> >> RF> Dec 23 12:30:19 alfred kernel: md: kicking non-fresh sda1 from array! >> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sda1> >> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sda1) >> RF> Dec 23 12:30:19 alfred kernel: md: md0: raid array is not clean -- starting background reconstruction >> RF> (no reconstruction is actually started, disks are idle) >> RF> Dec 23 12:30:19 alfred kernel: raid5: automatically using best checksumming function: pIII_sse >> RF> Dec 23 12:30:19 alfred kernel: pIII_sse : 7085.000 MB/sec >> RF> Dec 23 12:30:19 alfred kernel: raid5: using function: pIII_sse (7085.000 MB/sec) >> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x1 896 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x2 972 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x4 893 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x8 934 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx1 1845 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx2 3250 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x1 1799 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x2 3067 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x1 2980 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x2 4015 MB/s >> RF> Dec 23 12:30:19 alfred kernel: raid6: using algorithm sse2x2 (4015 MB/s) >> RF> Dec 23 12:30:19 alfred kernel: md: raid6 personality registered for level 6 >> RF> Dec 23 12:30:19 alfred kernel: md: raid5 personality registered for level 5 >> RF> Dec 23 12:30:19 alfred kernel: md: raid4 personality registered for level 4 >> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdd1 operational as raid disk 1 >> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdc1 operational as raid disk 3 >> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdb1 operational as raid disk 0 >> RF> Dec 23 12:30:19 alfred kernel: raid5: cannot start dirty degraded array for md0 >> RF> Dec 23 12:30:19 alfred kernel: RAID5 conf printout: >> RF> Dec 23 12:30:19 alfred kernel: --- rd:4 wd:3 fd:1 >> RF> Dec 23 12:30:19 alfred kernel: disk 0, o:1, dev:sdb1 >> RF> Dec 23 12:30:19 alfred kernel: disk 1, o:1, dev:sdd1 >> RF> Dec 23 12:30:19 alfred kernel: disk 3, o:1, dev:sdc1 >> RF> Dec 23 12:30:19 alfred kernel: raid5: failed to run raid set md0 >> RF> Dec 23 12:30:19 alfred kernel: md: pers->run() failed ... >> RF> Dec 23 12:30:19 alfred kernel: md: do_md_run() returned -5 >> RF> Dec 23 12:30:19 alfred kernel: md: md0 stopped. >> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdd1> >> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdd1) >> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdc1> >> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdc1) >> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdb1> >> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdb1) >> RF> Dec 23 12:30:19 alfred kernel: md: ... autorun DONE. >> RF> Dec 23 12:30:19 alfred kernel: device-mapper: multipath: version 1.0.5 loaded >> >> RF> # cat /proc/mdstat >> RF> Personalities : [raid6] [raid5] [raid4] >> RF> unused devices: <none> >> >> RF> filesystem used on top of md0 is xfs. >> >> RF> please advice what to do next and let me know if you need further >> RF> information. really don't want to lose 3TB worth of data :-( >> >> >> RF> tnx in advance. >> >> RF> -- >> RF> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> RF> the body of a message to majordomo@xxxxxxxxxxxxxxx >> RF> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> ------------------------------------------------------------------------------ >> Unix gives you just enough rope to hang yourself -- and then a couple of more >> feet, just to be sure. >> (Eric Allman) >> ------------------------------------------------------------------------------ >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> ------------------------------------------------------------------------------ Unix gives you just enough rope to hang yourself -- and then a couple of more feet, just to be sure. (Eric Allman) ------------------------------------------------------------------------------ -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html