Re[2]: raid5: cannot start dirty degraded array

Rainer Fuegenstein <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 23 Dec 2009 14:44:02 +0100

MB> Give the output of these:
MB> mdadm -E /dev/sd[a-z]

]# mdadm -E /dev/sd[a-z]
mdadm: No md superblock detected on /dev/sda.
mdadm: No md superblock detected on /dev/sdb.
mdadm: No md superblock detected on /dev/sdc.
mdadm: No md superblock detected on /dev/sdd.

I assume that's not a good sign ?!

sda was powered on and running after the reboot, a smartctl short test
revealed no errors and smartctl -a also looks unsuspicious (see
below). the drives are rather new.

guess its more likely to be either a problem of the power supply
(400W) or communication between controller and disk.

/dev/sdd (before it was replaced) reported the following:

Dec 20 07:18:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
Dec 20 07:48:53 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
Dec 20 08:18:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
Dec 20 08:48:55 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
Dec 20 09:18:53 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
Dec 20 09:48:58 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
Dec 20 10:19:01 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
Dec 20 10:48:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors

(what triggered a re-sync of the array)

# smartctl -a /dev/sda
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD15EADS-00R6B0
Serial Number:    WD-WCAUP0017818
Firmware Version: 01.00A01
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Dec 23 14:40:46 2009 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (40800) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   177   145   021    Pre-fail  Always       -       8133
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       5272
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       13
194 Temperature_Celsius     0x0022   125   109   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      5272         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

>>From the errors you show, it seems like one of the disks is dead (sda)
MB> or dying. It could be just a bad PCB (the controller board of the
MB> disk) as it refuses to return SMART data, so you might be able to
MB> rescue data by changing the PCB, if it's that important to have that
MB> disk.

MB> As for the array, you can run a degraded array by force assembling it:
MB> mdadm -Af /dev/md0
MB> In the command above, mdadm will search on existing disks and
MB> partitions, which of them belongs to an array and assemble that array,
MB> if possible.

MB> I also suggest you install smartmontools package and run smartctl -a
MB> /dev/sd[a-z] and see the report for each disk to make sure you don't
MB> have bad sectors or bad cables (CRC/ATA read errors) on any of the
MB> disks.

MB> On Wed, Dec 23, 2009 at 3:50 PM, Rainer Fuegenstein
MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>> addendum: when going through the logs I found the reason:
>>
>> Dec 23 02:55:40 alfred kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> Dec 23 02:55:40 alfred kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>> Dec 23 02:55:40 alfred kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> Dec 23 02:55:40 alfred kernel: ata1.00: status: { DRDY }
>> Dec 23 02:55:45 alfred kernel: ata1: link is slow to respond, please be patient (ready=0)
>> Dec 23 02:55:50 alfred kernel: ata1: device not ready (errno=-16), forcing hardreset
>> Dec 23 02:55:50 alfred kernel: ata1: soft resetting link
>> Dec 23 02:55:55 alfred kernel: ata1: link is slow to respond, please be patient (ready=0)
>> Dec 23 02:56:00 alfred kernel: ata1: SRST failed (errno=-16)
>> Dec 23 02:56:00 alfred kernel: ata1: soft resetting link
>> Dec 23 02:56:05 alfred kernel: ata1: link is slow to respond, please be patient (ready=0)
>> Dec 23 02:56:10 alfred kernel: ata1: SRST failed (errno=-16)
>> Dec 23 02:56:10 alfred kernel: ata1: soft resetting link
>> Dec 23 02:56:15 alfred kernel: ata1: link is slow to respond, please be patient (ready=0)
>> Dec 23 02:56:45 alfred kernel: ata1: SRST failed (errno=-16)
>> Dec 23 02:56:45 alfred kernel: ata1: limiting SATA link speed to 1.5 Gbps
>> Dec 23 02:56:45 alfred kernel: ata1: soft resetting link
>> Dec 23 02:56:50 alfred kernel: ata1: SRST failed (errno=-16)
>> Dec 23 02:56:50 alfred kernel: ata1: reset failed, giving up
>> Dec 23 02:56:50 alfred kernel: ata1.00: disabled
>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: timing out command, waited 30s
>> Dec 23 02:56:50 alfred kernel: ata1: EH complete
>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1244700223
>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1554309191
>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1554309439
>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 572721343
>> Dec 23 02:56:50 alfred kernel: raid5: Disk failure on sda1, disabling device. Operation continuing on 3 devices
>> Dec 23 02:56:50 alfred kernel: RAID5 conf printout:
>> Dec 23 02:56:50 alfred kernel:  --- rd:4 wd:3 fd:1
>> Dec 23 02:56:50 alfred kernel:  disk 0, o:1, dev:sdb1
>> Dec 23 02:56:50 alfred kernel:  disk 1, o:1, dev:sdd1
>> Dec 23 02:56:50 alfred kernel:  disk 2, o:0, dev:sda1
>> Dec 23 02:56:50 alfred kernel:  disk 3, o:1, dev:sdc1
>> Dec 23 02:56:50 alfred kernel: RAID5 conf printout:
>> Dec 23 02:56:50 alfred kernel:  --- rd:4 wd:3 fd:1
>> Dec 23 02:56:50 alfred kernel:  disk 0, o:1, dev:sdb1
>> Dec 23 02:56:50 alfred kernel:  disk 1, o:1, dev:sdd1
>> Dec 23 02:56:50 alfred kernel:  disk 3, o:1, dev:sdc1
>> Dec 23 03:22:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>> Dec 23 03:22:57 alfred smartd[2692]: Sending warning via mail to root ...
>> Dec 23 03:22:58 alfred smartd[2692]: Warning via mail to root: successful
>> Dec 23 03:22:58 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data
>> Dec 23 03:22:58 alfred smartd[2692]: Sending warning via mail to root ...
>> Dec 23 03:22:58 alfred smartd[2692]: Warning via mail to root: successful
>> Dec 23 03:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>> Dec 23 03:52:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data
>> Dec 23 04:22:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>> Dec 23 04:22:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data
>> Dec 23 04:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>>  [...]
>> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data
>>  (crash here)
>>
>>
>> RF> hi,
>>
>> RF> got a "nice" early christmas present this morning: after a crash, the raid5
>> RF> (consisting of 4*1.5TB WD caviar green SATA disks) won't start :-(
>>
>> RF> the history:
>> RF> sometimes, the raid kicked out one disk, started a resync (which
>> RF> lasted for about 3 days) and was fine after that. a few days ago I
>> RF> replaced drive sdd (which seemed to cause the troubles) and synced the
>> RF> raid again which finished yesterday in the early afternoon. at 10am
>> RF> today the system crashed and the raid won't start:
>>
>> RF> OS is Centos 5
>> RF> mdadm - v2.6.9 - 10th March 2009
>> RF> Linux alfred 2.6.18-164.6.1.el5xen #1 SMP Tue Nov 3 17:53:47 EST 2009 i686 athlon i386 GNU/Linux
>>
>>
>> RF> Dec 23 12:30:19 alfred kernel: md: Autodetecting RAID arrays.
>> RF> Dec 23 12:30:19 alfred kernel: md: autorun ...
>> RF> Dec 23 12:30:19 alfred kernel: md: considering sdd1 ...
>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdd1 ...
>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdc1 ...
>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdb1 ...
>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sda1 ...
>> RF> Dec 23 12:30:19 alfred kernel: md: created md0
>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sda1>
>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdb1>
>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdc1>
>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdd1>
>> RF> Dec 23 12:30:19 alfred kernel: md: running: <sdd1><sdc1><sdb1><sda1>
>> RF> Dec 23 12:30:19 alfred kernel: md: kicking non-fresh sda1 from array!
>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sda1>
>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sda1)
>> RF> Dec 23 12:30:19 alfred kernel: md: md0: raid array is not clean -- starting background reconstruction
>> RF>     (no reconstruction is actually started, disks are idle)
>> RF> Dec 23 12:30:19 alfred kernel: raid5: automatically using best checksumming function: pIII_sse
>> RF> Dec 23 12:30:19 alfred kernel:    pIII_sse  :  7085.000 MB/sec
>> RF> Dec 23 12:30:19 alfred kernel: raid5: using function: pIII_sse (7085.000 MB/sec)
>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x1    896 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x2    972 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x4    893 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x8    934 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx1     1845 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx2     3250 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x1    1799 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x2    3067 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x1    2980 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x2    4015 MB/s
>> RF> Dec 23 12:30:19 alfred kernel: raid6: using algorithm sse2x2 (4015 MB/s)
>> RF> Dec 23 12:30:19 alfred kernel: md: raid6 personality registered for level 6
>> RF> Dec 23 12:30:19 alfred kernel: md: raid5 personality registered for level 5
>> RF> Dec 23 12:30:19 alfred kernel: md: raid4 personality registered for level 4
>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdd1 operational as raid disk 1
>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdc1 operational as raid disk 3
>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdb1 operational as raid disk 0
>> RF> Dec 23 12:30:19 alfred kernel: raid5: cannot start dirty degraded array for md0
>> RF> Dec 23 12:30:19 alfred kernel: RAID5 conf printout:
>> RF> Dec 23 12:30:19 alfred kernel:  --- rd:4 wd:3 fd:1
>> RF> Dec 23 12:30:19 alfred kernel:  disk 0, o:1, dev:sdb1
>> RF> Dec 23 12:30:19 alfred kernel:  disk 1, o:1, dev:sdd1
>> RF> Dec 23 12:30:19 alfred kernel:  disk 3, o:1, dev:sdc1
>> RF> Dec 23 12:30:19 alfred kernel: raid5: failed to run raid set md0
>> RF> Dec 23 12:30:19 alfred kernel: md: pers->run() failed ...
>> RF> Dec 23 12:30:19 alfred kernel: md: do_md_run() returned -5
>> RF> Dec 23 12:30:19 alfred kernel: md: md0 stopped.
>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdd1>
>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdd1)
>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdc1>
>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdc1)
>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdb1>
>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdb1)
>> RF> Dec 23 12:30:19 alfred kernel: md: ... autorun DONE.
>> RF> Dec 23 12:30:19 alfred kernel: device-mapper: multipath: version 1.0.5 loaded
>>
>> RF> # cat /proc/mdstat
>> RF> Personalities : [raid6] [raid5] [raid4]
>> RF> unused devices: <none>
>>
>> RF> filesystem used on top of md0 is xfs.
>>
>> RF> please advice what to do next and let me know if you need further
>> RF> information. really don't want to lose 3TB worth of data :-(
>>
>>
>> RF> tnx in advance.
>>
>> RF> --
>> RF> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> RF> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> RF> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> ------------------------------------------------------------------------------
>> Unix gives you just enough rope to hang yourself -- and then a couple of more
>> feet, just to be sure.
>> (Eric Allman)
>> ------------------------------------------------------------------------------
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

------------------------------------------------------------------------------
Unix gives you just enough rope to hang yourself -- and then a couple of more 
feet, just to be sure.
(Eric Allman)
------------------------------------------------------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html