Re[10]: raid5: cannot start dirty degraded array

Rainer Fuegenstein <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 23 Dec 2009 18:03:08 +0100

MB> Is the disk being kicked always on the same port? (port 1 for example)

not sure how to interpret the syslog messages:

Nov 28 21:24:40 alfred kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 28 21:24:40 alfred kernel: ata2.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
Nov 28 21:24:40 alfred kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 28 21:24:40 alfred kernel: ata2.00: status: { DRDY }
Nov 28 21:24:40 alfred kernel: ata2: soft resetting link
Nov 28 21:24:41 alfred kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 28 21:24:41 alfred kernel: ata2.00: configured for UDMA/133
Nov 28 21:24:41 alfred kernel: ata2: EH complete
Nov 28 21:24:41 alfred kernel: SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB)
Nov 28 21:24:41 alfred kernel: sdb: Write Protect is off
Nov 28 21:24:41 alfred kernel: SCSI device sdb: drive cache: write back
Nov 28 21:24:41 alfred smartd[2770]: Device: /dev/sdd, 1 Offline uncorrectable sectors

the smartd message for sdd appears frequently, that's why I replaced
the drive. the timeout above occured 3 times within the last month for
sdb. guess you are right with either the port or the cable.

tonight it was sda, but I might have disturbed the cable without
noticing when replacing sdd.

MB> If so, then you may have a problem with that specific port. If it
MB> kicks disks randomly, and you're sure that your cables or disks are
MB> healthy, then it's probably time to change the motherboard.

I plan to move to the new atom/pinetrail mainboards as soon as they
are available in january. hope that solves this issue. but will check
the cable anyway.

tnx & cu

MB> Increasing the resync values of min will slow down your server if
MB> you're trying to access it during a resync.

MB> On Wed, Dec 23, 2009 at 6:13 PM, Rainer Fuegenstein
MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> MB> I don't know why your array takes 3 days to resync. My array is 7TB in
>> MB> side (8x1TB @ RAID5) and it takes about 16 hours.
>>
>> that's definitely a big mystery. I put this to this list some time ago
>> when upgrading the same array from 4*750GB to 4*1500GB by replacing
>> one disk after the other and finally --growing the raid:
>>
>> 1st disk took just a few minutes
>> 2nd disk some hours
>> 3rd disk more than a day
>> 4th disk about 2+ days
>> --grow also took  2+ days
>>
>> MB> Check the value of this file:
>> MB> cat /proc/sys/dev/raid/speed_limit_max
>>
>> default values are:
>> [root@alfred cdrom]# cat /proc/sys/dev/raid/speed_limit_max
>> 200000
>> [root@alfred cdrom]# cat /proc/sys/dev/raid/speed_limit_min
>> 1000
>>
>> when resyncing (with these default values), the server becomes awfuly
>> slow (streaming mp3 via smb suffers timeouts).
>>
>> mainboard is an Asus M2N with NFORCE-MCP61 chipset.
>>
>> this server started on an 800MHz asus board with 4*400 GB PATA disks
>> and had this one-disk-failure from the start (every few months). over the
>> years everything was replaced (power supply, mainboard, disks,
>> controller, pata to sata, ...) but it still kicks out disks (with the
>> current asus M2N board about every two to three weeks).
>>
>> must be cosmic radiation to blame ...
>>
>>
>> MB> Make it a high number so that when there's no process querying the
>> MB> disks, the resync process will go for the max speed.
>> echo '200000' >> /proc/sys/dev/raid/speed_limit_max
>> MB> (200 MB/s)
>>
>> MB> The file /proc/sys/dev/raid/speed_limit_min specified the minimum
>> MB> speed at which the array should resync, even when there are other
>> MB> programs querying the disks.
>>
>> MB> Make sure you run the above changes just before you issue a resync.
>> MB> Changes are lost on reboot.
>>
>> MB> On Wed, Dec 23, 2009 at 5:30 PM, Rainer Fuegenstein
>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>> tnx for the info, in the meantime I did:
>>>>
>>>> mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
>>>>
>>>> there was no mdadm.conf file, so I had to specify all devices and do a
>>>> --force
>>>>
>>>>
>>>> # cat /proc/mdstat
>>>> Personalities : [raid6] [raid5] [raid4]
>>>> md0 : active raid5 sdb1[0] sdc1[3] sdd1[1]
>>>>      4395407808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
>>>>
>>>> unused devices: <none>
>>>>
>>>> md0 is up :-)
>>>>
>>>> I'm about to start backing up the most important data; when this is
>>>> done I assume the proper way to get back to normal again is:
>>>>
>>>> - remove the bad drive from the array: mdadm /dev/md0 -r /dev/sda1
>>>> - physically replace sda with a new drive
>>>> - add it back: mdadm /dev/md0 -a /dev/sda1
>>>> - wait three days for the sync to complete (and keep fingers crossed
>>>> that no other drive fails)
>>>>
>>>> big tnx!
>>>>
>>>>
>>>> MB> sda1 was the only affected member of the array so you should be able
>>>> MB> to force-assemble the raid5 array and run it in degraded mode.
>>>>
>>>> MB> mdadm -Af /dev/md0
>>>> MB> If that doesn't work for any reason, do this:
>>>> MB> mdadm -Af /dev/md0 /dev/sdb1 /dev/sdd1 /dev/sdc1
>>>>
>>>> MB> You can note the disk order from the output of mdadm -E
>>>>
>>>> MB> On Wed, Dec 23, 2009 at 5:02 PM, Rainer Fuegenstein
>>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> MB> My bad, run this: mdadm -E /dev/sd[a-z]1
>>>>>> should have figured this out myself (sorry; currently running in
>>>>>> panic mode ;-) )
>>>>>>
>>>>>> MB> 1 is the partition which most likely you added to the array rather
>>>>>> MB> than the whole disk (which is normal).
>>>>>>
>>>>>> # mdadm -E /dev/sd[a-z]1
>>>>>> /dev/sda1:
>>>>>>          Magic : a92b4efc
>>>>>>        Version : 0.90.00
>>>>>>           UUID : 81833582:d651e953:48cc5797:38b256ea
>>>>>>  Creation Time : Mon Mar 31 13:30:45 2008
>>>>>>     Raid Level : raid5
>>>>>>  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>>>>>     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
>>>>>>   Raid Devices : 4
>>>>>>  Total Devices : 4
>>>>>> Preferred Minor : 0
>>>>>>
>>>>>>    Update Time : Wed Dec 23 02:54:49 2009
>>>>>>          State : clean
>>>>>>  Active Devices : 4
>>>>>> Working Devices : 4
>>>>>>  Failed Devices : 0
>>>>>>  Spare Devices : 0
>>>>>>       Checksum : 6cfa3a64 - correct
>>>>>>         Events : 119530
>>>>>>
>>>>>>         Layout : left-symmetric
>>>>>>     Chunk Size : 64K
>>>>>>
>>>>>>      Number   Major   Minor   RaidDevice State
>>>>>> this     2       8        1        2      active sync   /dev/sda1
>>>>>>
>>>>>>   0     0       8       17        0      active sync   /dev/sdb1
>>>>>>   1     1       8       49        1      active sync   /dev/sdd1
>>>>>>   2     2       8        1        2      active sync   /dev/sda1
>>>>>>   3     3       8       33        3      active sync   /dev/sdc1
>>>>>> /dev/sdb1:
>>>>>>          Magic : a92b4efc
>>>>>>        Version : 0.90.00
>>>>>>           UUID : 81833582:d651e953:48cc5797:38b256ea
>>>>>>  Creation Time : Mon Mar 31 13:30:45 2008
>>>>>>     Raid Level : raid5
>>>>>>  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>>>>>     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
>>>>>>   Raid Devices : 4
>>>>>>  Total Devices : 4
>>>>>> Preferred Minor : 0
>>>>>>
>>>>>>    Update Time : Wed Dec 23 10:07:42 2009
>>>>>>          State : active
>>>>>>  Active Devices : 3
>>>>>> Working Devices : 3
>>>>>>  Failed Devices : 1
>>>>>>  Spare Devices : 0
>>>>>>       Checksum : 6cf8f610 - correct
>>>>>>         Events : 130037
>>>>>>
>>>>>>         Layout : left-symmetric
>>>>>>     Chunk Size : 64K
>>>>>>
>>>>>>      Number   Major   Minor   RaidDevice State
>>>>>> this     0       8       17        0      active sync   /dev/sdb1
>>>>>>
>>>>>>   0     0       8       17        0      active sync   /dev/sdb1
>>>>>>   1     1       8       49        1      active sync   /dev/sdd1
>>>>>>   2     2       0        0        2      faulty removed
>>>>>>   3     3       8       33        3      active sync   /dev/sdc1
>>>>>> /dev/sdc1:
>>>>>>          Magic : a92b4efc
>>>>>>        Version : 0.90.00
>>>>>>           UUID : 81833582:d651e953:48cc5797:38b256ea
>>>>>>  Creation Time : Mon Mar 31 13:30:45 2008
>>>>>>     Raid Level : raid5
>>>>>>  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>>>>>     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
>>>>>>   Raid Devices : 4
>>>>>>  Total Devices : 4
>>>>>> Preferred Minor : 0
>>>>>>
>>>>>>    Update Time : Wed Dec 23 10:07:42 2009
>>>>>>          State : active
>>>>>>  Active Devices : 3
>>>>>> Working Devices : 3
>>>>>>  Failed Devices : 1
>>>>>>  Spare Devices : 0
>>>>>>       Checksum : 6cf8f626 - correct
>>>>>>         Events : 130037
>>>>>>
>>>>>>         Layout : left-symmetric
>>>>>>     Chunk Size : 64K
>>>>>>
>>>>>>      Number   Major   Minor   RaidDevice State
>>>>>> this     3       8       33        3      active sync   /dev/sdc1
>>>>>>
>>>>>>   0     0       8       17        0      active sync   /dev/sdb1
>>>>>>   1     1       8       49        1      active sync   /dev/sdd1
>>>>>>   2     2       0        0        2      faulty removed
>>>>>>   3     3       8       33        3      active sync   /dev/sdc1
>>>>>> /dev/sdd1:
>>>>>>          Magic : a92b4efc
>>>>>>        Version : 0.90.00
>>>>>>           UUID : 81833582:d651e953:48cc5797:38b256ea
>>>>>>  Creation Time : Mon Mar 31 13:30:45 2008
>>>>>>     Raid Level : raid5
>>>>>>  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
>>>>>>     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
>>>>>>   Raid Devices : 4
>>>>>>  Total Devices : 4
>>>>>> Preferred Minor : 0
>>>>>>
>>>>>>    Update Time : Wed Dec 23 10:07:42 2009
>>>>>>          State : active
>>>>>>  Active Devices : 3
>>>>>> Working Devices : 3
>>>>>>  Failed Devices : 1
>>>>>>  Spare Devices : 0
>>>>>>       Checksum : 6cf8f632 - correct
>>>>>>         Events : 130037
>>>>>>
>>>>>>         Layout : left-symmetric
>>>>>>     Chunk Size : 64K
>>>>>>
>>>>>>      Number   Major   Minor   RaidDevice State
>>>>>> this     1       8       49        1      active sync   /dev/sdd1
>>>>>>
>>>>>>   0     0       8       17        0      active sync   /dev/sdb1
>>>>>>   1     1       8       49        1      active sync   /dev/sdd1
>>>>>>   2     2       0        0        2      faulty removed
>>>>>>   3     3       8       33        3      active sync   /dev/sdc1
>>>>>> [root@alfred log]#
>>>>>>
>>>>>> MB> You've included the smart report of one disk only. I suggest you look
>>>>>> MB> at the other disks as well and make sure that they're not reporting
>>>>>> MB> any errors. Also, keep in mind that you should run smart test
>>>>>> MB> periodically (can be configured) and that if you haven't run any test
>>>>>> MB> before, you have to run a long or offline test before making sure that
>>>>>> MB> you don't have bad sectors.
>>>>>>
>>>>>> tnx for the hint, will do that as soon as I got my data back (if ever
>>>>>> ...)
>>>>>>
>>>>>>
>>>>>> MB> On Wed, Dec 23, 2009 at 4:44 PM, Rainer Fuegenstein
>>>>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> MB> Give the output of these:
>>>>>>>> MB> mdadm -E /dev/sd[a-z]
>>>>>>>>
>>>>>>>> ]# mdadm -E /dev/sd[a-z]
>>>>>>>> mdadm: No md superblock detected on /dev/sda.
>>>>>>>> mdadm: No md superblock detected on /dev/sdb.
>>>>>>>> mdadm: No md superblock detected on /dev/sdc.
>>>>>>>> mdadm: No md superblock detected on /dev/sdd.
>>>>>>>>
>>>>>>>> I assume that's not a good sign ?!
>>>>>>>>
>>>>>>>> sda was powered on and running after the reboot, a smartctl short test
>>>>>>>> revealed no errors and smartctl -a also looks unsuspicious (see
>>>>>>>> below). the drives are rather new.
>>>>>>>>
>>>>>>>> guess its more likely to be either a problem of the power supply
>>>>>>>> (400W) or communication between controller and disk.
>>>>>>>>
>>>>>>>> /dev/sdd (before it was replaced) reported the following:
>>>>>>>>
>>>>>>>> Dec 20 07:18:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
>>>>>>>> Dec 20 07:48:53 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
>>>>>>>> Dec 20 08:18:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
>>>>>>>> Dec 20 08:48:55 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
>>>>>>>> Dec 20 09:18:53 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
>>>>>>>> Dec 20 09:48:58 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
>>>>>>>> Dec 20 10:19:01 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
>>>>>>>> Dec 20 10:48:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
>>>>>>>>
>>>>>>>> (what triggered a re-sync of the array)
>>>>>>>>
>>>>>>>>
>>>>>>>> # smartctl -a /dev/sda
>>>>>>>> smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
>>>>>>>> Home page is http://smartmontools.sourceforge.net/
>>>>>>>>
>>>>>>>> === START OF INFORMATION SECTION ===
>>>>>>>> Device Model:     WDC WD15EADS-00R6B0
>>>>>>>> Serial Number:    WD-WCAUP0017818
>>>>>>>> Firmware Version: 01.00A01
>>>>>>>> User Capacity:    1,500,301,910,016 bytes
>>>>>>>> Device is:        Not in smartctl database [for details use: -P showall]
>>>>>>>> ATA Version is:   8
>>>>>>>> ATA Standard is:  Exact ATA specification draft version not indicated
>>>>>>>> Local Time is:    Wed Dec 23 14:40:46 2009 CET
>>>>>>>> SMART support is: Available - device has SMART capability.
>>>>>>>> SMART support is: Enabled
>>>>>>>>
>>>>>>>> === START OF READ SMART DATA SECTION ===
>>>>>>>> SMART overall-health self-assessment test result: PASSED
>>>>>>>>
>>>>>>>> General SMART Values:
>>>>>>>> Offline data collection status:  (0x82) Offline data collection activity
>>>>>>>>                                        was completed without error.
>>>>>>>>                                        Auto Offline Data Collection: Enabled.
>>>>>>>> Self-test execution status:      (   0) The previous self-test routine completed
>>>>>>>>                                        without error or no self-test has ever
>>>>>>>>                                        been run.
>>>>>>>> Total time to complete Offline
>>>>>>>> data collection:                 (40800) seconds.
>>>>>>>> Offline data collection
>>>>>>>> capabilities:                    (0x7b) SMART execute Offline immediate.
>>>>>>>>                                        Auto Offline data collection on/off support.
>>>>>>>>                                        Suspend Offline collection upon new
>>>>>>>>                                        command.
>>>>>>>>                                        Offline surface scan supported.
>>>>>>>>                                        Self-test supported.
>>>>>>>>                                        Conveyance Self-test supported.
>>>>>>>>                                        Selective Self-test supported.
>>>>>>>> SMART capabilities:            (0x0003) Saves SMART data before entering
>>>>>>>>                                        power-saving mode.
>>>>>>>>                                        Supports SMART auto save timer.
>>>>>>>> Error logging capability:        (0x01) Error logging supported.
>>>>>>>>                                        General Purpose Logging supported.
>>>>>>>> Short self-test routine
>>>>>>>> recommended polling time:        (   2) minutes.
>>>>>>>> Extended self-test routine
>>>>>>>> recommended polling time:        ( 255) minutes.
>>>>>>>> Conveyance self-test routine
>>>>>>>> recommended polling time:        (   5) minutes.
>>>>>>>> SCT capabilities:              (0x303f) SCT Status supported.
>>>>>>>>                                        SCT Feature Control supported.
>>>>>>>>                                        SCT Data Table supported.
>>>>>>>>
>>>>>>>> SMART Attributes Data Structure revision number: 16
>>>>>>>> Vendor Specific SMART Attributes with Thresholds:
>>>>>>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>>>>>>>>  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
>>>>>>>>  3 Spin_Up_Time            0x0027   177   145   021    Pre-fail  Always       -       8133
>>>>>>>>  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
>>>>>>>>  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
>>>>>>>>  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
>>>>>>>>  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       5272
>>>>>>>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
>>>>>>>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
>>>>>>>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
>>>>>>>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
>>>>>>>> 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       13
>>>>>>>> 194 Temperature_Celsius     0x0022   125   109   000    Old_age   Always       -       27
>>>>>>>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
>>>>>>>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
>>>>>>>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
>>>>>>>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
>>>>>>>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
>>>>>>>>
>>>>>>>> SMART Error Log Version: 1
>>>>>>>> No Errors Logged
>>>>>>>>
>>>>>>>> SMART Self-test log structure revision number 1
>>>>>>>> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
>>>>>>>> # 1  Short offline       Completed without error       00%      5272         -
>>>>>>>>
>>>>>>>> SMART Selective self-test log data structure revision number 1
>>>>>>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>>>>>>>    1        0        0  Not_testing
>>>>>>>>    2        0        0  Not_testing
>>>>>>>>    3        0        0  Not_testing
>>>>>>>>    4        0        0  Not_testing
>>>>>>>>    5        0        0  Not_testing
>>>>>>>> Selective self-test flags (0x0):
>>>>>>>>  After scanning selected spans, do NOT read-scan remainder of disk.
>>>>>>>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>From the errors you show, it seems like one of the disks is dead (sda)
>>>>>>>> MB> or dying. It could be just a bad PCB (the controller board of the
>>>>>>>> MB> disk) as it refuses to return SMART data, so you might be able to
>>>>>>>> MB> rescue data by changing the PCB, if it's that important to have that
>>>>>>>> MB> disk.
>>>>>>>>
>>>>>>>> MB> As for the array, you can run a degraded array by force assembling it:
>>>>>>>> MB> mdadm -Af /dev/md0
>>>>>>>> MB> In the command above, mdadm will search on existing disks and
>>>>>>>> MB> partitions, which of them belongs to an array and assemble that array,
>>>>>>>> MB> if possible.
>>>>>>>>
>>>>>>>> MB> I also suggest you install smartmontools package and run smartctl -a
>>>>>>>> MB> /dev/sd[a-z] and see the report for each disk to make sure you don't
>>>>>>>> MB> have bad sectors or bad cables (CRC/ATA read errors) on any of the
>>>>>>>> MB> disks.
>>>>>>>>
>>>>>>>> MB> On Wed, Dec 23, 2009 at 3:50 PM, Rainer Fuegenstein
>>>>>>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>> addendum: when going through the logs I found the reason:
>>>>>>>>>>
>>>>>>>>>> Dec 23 02:55:40 alfred kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>>>>>>>> Dec 23 02:55:40 alfred kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>>>>>>>>> Dec 23 02:55:40 alfred kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>>>>>>>>> Dec 23 02:55:40 alfred kernel: ata1.00: status: { DRDY }
>>>>>>>>>> Dec 23 02:55:45 alfred kernel: ata1: link is slow to respond, please be patient (ready=0)
>>>>>>>>>> Dec 23 02:55:50 alfred kernel: ata1: device not ready (errno=-16), forcing hardreset
>>>>>>>>>> Dec 23 02:55:50 alfred kernel: ata1: soft resetting link
>>>>>>>>>> Dec 23 02:55:55 alfred kernel: ata1: link is slow to respond, please be patient (ready=0)
>>>>>>>>>> Dec 23 02:56:00 alfred kernel: ata1: SRST failed (errno=-16)
>>>>>>>>>> Dec 23 02:56:00 alfred kernel: ata1: soft resetting link
>>>>>>>>>> Dec 23 02:56:05 alfred kernel: ata1: link is slow to respond, please be patient (ready=0)
>>>>>>>>>> Dec 23 02:56:10 alfred kernel: ata1: SRST failed (errno=-16)
>>>>>>>>>> Dec 23 02:56:10 alfred kernel: ata1: soft resetting link
>>>>>>>>>> Dec 23 02:56:15 alfred kernel: ata1: link is slow to respond, please be patient (ready=0)
>>>>>>>>>> Dec 23 02:56:45 alfred kernel: ata1: SRST failed (errno=-16)
>>>>>>>>>> Dec 23 02:56:45 alfred kernel: ata1: limiting SATA link speed to 1.5 Gbps
>>>>>>>>>> Dec 23 02:56:45 alfred kernel: ata1: soft resetting link
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: ata1: SRST failed (errno=-16)
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: ata1: reset failed, giving up
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: ata1.00: disabled
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: timing out command, waited 30s
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: ata1: EH complete
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1244700223
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1554309191
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1554309439
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 572721343
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: raid5: Disk failure on sda1, disabling device. Operation continuing on 3 devices
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: RAID5 conf printout:
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  --- rd:4 wd:3 fd:1
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 0, o:1, dev:sdb1
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 1, o:1, dev:sdd1
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 2, o:0, dev:sda1
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 3, o:1, dev:sdc1
>>>>>>>>>> Dec 23 02:56:50 alfred kernel: RAID5 conf printout:
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  --- rd:4 wd:3 fd:1
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 0, o:1, dev:sdb1
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 1, o:1, dev:sdd1
>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 3, o:1, dev:sdc1
>>>>>>>>>> Dec 23 03:22:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>>>>>>>>>> Dec 23 03:22:57 alfred smartd[2692]: Sending warning via mail to root ...
>>>>>>>>>> Dec 23 03:22:58 alfred smartd[2692]: Warning via mail to root: successful
>>>>>>>>>> Dec 23 03:22:58 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data
>>>>>>>>>> Dec 23 03:22:58 alfred smartd[2692]: Sending warning via mail to root ...
>>>>>>>>>> Dec 23 03:22:58 alfred smartd[2692]: Warning via mail to root: successful
>>>>>>>>>> Dec 23 03:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>>>>>>>>>> Dec 23 03:52:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data
>>>>>>>>>> Dec 23 04:22:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>>>>>>>>>> Dec 23 04:22:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data
>>>>>>>>>> Dec 23 04:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>>>>>>>>>>  [...]
>>>>>>>>>> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check
>>>>>>>>>> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data
>>>>>>>>>>  (crash here)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> RF> hi,
>>>>>>>>>>
>>>>>>>>>> RF> got a "nice" early christmas present this morning: after a crash, the raid5
>>>>>>>>>> RF> (consisting of 4*1.5TB WD caviar green SATA disks) won't start :-(
>>>>>>>>>>
>>>>>>>>>> RF> the history:
>>>>>>>>>> RF> sometimes, the raid kicked out one disk, started a resync (which
>>>>>>>>>> RF> lasted for about 3 days) and was fine after that. a few days ago I
>>>>>>>>>> RF> replaced drive sdd (which seemed to cause the troubles) and synced the
>>>>>>>>>> RF> raid again which finished yesterday in the early afternoon. at 10am
>>>>>>>>>> RF> today the system crashed and the raid won't start:
>>>>>>>>>>
>>>>>>>>>> RF> OS is Centos 5
>>>>>>>>>> RF> mdadm - v2.6.9 - 10th March 2009
>>>>>>>>>> RF> Linux alfred 2.6.18-164.6.1.el5xen #1 SMP Tue Nov 3 17:53:47 EST 2009 i686 athlon i386 GNU/Linux
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: Autodetecting RAID arrays.
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: autorun ...
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: considering sdd1 ...
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdd1 ...
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdc1 ...
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdb1 ...
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sda1 ...
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: created md0
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sda1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdb1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdc1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdd1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: running: <sdd1><sdc1><sdb1><sda1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: kicking non-fresh sda1 from array!
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sda1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sda1)
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: md0: raid array is not clean -- starting background reconstruction
>>>>>>>>>> RF>     (no reconstruction is actually started, disks are idle)
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: automatically using best checksumming function: pIII_sse
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel:    pIII_sse  :  7085.000 MB/sec
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: using function: pIII_sse (7085.000 MB/sec)
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x1    896 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x2    972 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x4    893 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x8    934 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx1     1845 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx2     3250 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x1    1799 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x2    3067 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x1    2980 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x2    4015 MB/s
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: using algorithm sse2x2 (4015 MB/s)
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: raid6 personality registered for level 6
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: raid5 personality registered for level 5
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: raid4 personality registered for level 4
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdd1 operational as raid disk 1
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdc1 operational as raid disk 3
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdb1 operational as raid disk 0
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: cannot start dirty degraded array for md0
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: RAID5 conf printout:
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel:  --- rd:4 wd:3 fd:1
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel:  disk 0, o:1, dev:sdb1
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel:  disk 1, o:1, dev:sdd1
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel:  disk 3, o:1, dev:sdc1
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: failed to run raid set md0
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: pers->run() failed ...
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: do_md_run() returned -5
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: md0 stopped.
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdd1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdd1)
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdc1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdc1)
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdb1>
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdb1)
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: ... autorun DONE.
>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: device-mapper: multipath: version 1.0.5 loaded
>>>>>>>>>>
>>>>>>>>>> RF> # cat /proc/mdstat
>>>>>>>>>> RF> Personalities : [raid6] [raid5] [raid4]
>>>>>>>>>> RF> unused devices: <none>
>>>>>>>>>>
>>>>>>>>>> RF> filesystem used on top of md0 is xfs.
>>>>>>>>>>
>>>>>>>>>> RF> please advice what to do next and let me know if you need further
>>>>>>>>>> RF> information. really don't want to lose 3TB worth of data :-(
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> RF> tnx in advance.
>>>>>>>>>>
>>>>>>>>>> RF> --
>>>>>>>>>> RF> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>>>>>>>> RF> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>> RF> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>> Unix gives you just enough rope to hang yourself -- and then a couple of more
>>>>>>>>>> feet, just to be sure.
>>>>>>>>>> (Eric Allman)
>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Unix gives you just enough rope to hang yourself -- and then a couple of more
>>>>>>>> feet, just to be sure.
>>>>>>>> (Eric Allman)
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Unix gives you just enough rope to hang yourself -- and then a couple of more
>>>>>> feet, just to be sure.
>>>>>> (Eric Allman)
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Unix gives you just enough rope to hang yourself -- and then a couple of more
>>>> feet, just to be sure.
>>>> (Eric Allman)
>>>> ------------------------------------------------------------------------------
>>>>
>>>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Unix gives you just enough rope to hang yourself -- and then a couple of more
>> feet, just to be sure.
>> (Eric Allman)
>> ------------------------------------------------------------------------------
>>
>>

------------------------------------------------------------------------------
Unix gives you just enough rope to hang yourself -- and then a couple of more 
feet, just to be sure.
(Eric Allman)
------------------------------------------------------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html