Re: raid5: cannot start dirty degraded array

Thomas Fjellstrom <tfjellstrom@xxxxxxx> · Thu, 24 Dec 2009 09:40:40 -0700

On Wed December 23 2009, Justin Piszcz wrote:
> Is anyone using (WD) 1.5TB (as noted below) successfully in an array
> without these errors?

I seem to recall SMART making my 2TB Green's flip out if used too much. But 
I'm not sure if that was due to the controller or what.

> On Wed, 23 Dec 2009, Rainer Fuegenstein wrote:
> > MB> Is the disk being kicked always on the same port? (port 1 for
> > example)
> >
> > not sure how to interpret the syslog messages:
> >
> > Nov 28 21:24:40 alfred kernel: ata2.00: exception Emask 0x0 SAct 0x0
> > SErr 0x0 action 0x6 frozen Nov 28 21:24:40 alfred kernel: ata2.00: cmd
> > b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 Nov 28 21:24:40 alfred
> > kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> > (timeout) Nov 28 21:24:40 alfred kernel: ata2.00: status: { DRDY }
> > Nov 28 21:24:40 alfred kernel: ata2: soft resetting link
> > Nov 28 21:24:41 alfred kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
> > SControl 300) Nov 28 21:24:41 alfred kernel: ata2.00: configured for
> > UDMA/133 Nov 28 21:24:41 alfred kernel: ata2: EH complete
> > Nov 28 21:24:41 alfred kernel: SCSI device sdb: 2930277168 512-byte
> > hdwr sectors (1500302 MB) Nov 28 21:24:41 alfred kernel: sdb: Write
> > Protect is off
> > Nov 28 21:24:41 alfred kernel: SCSI device sdb: drive cache: write back
> > Nov 28 21:24:41 alfred smartd[2770]: Device: /dev/sdd, 1 Offline
> > uncorrectable sectors
> >
> > the smartd message for sdd appears frequently, that's why I replaced
> > the drive. the timeout above occured 3 times within the last month for
> > sdb. guess you are right with either the port or the cable.
> >
> > tonight it was sda, but I might have disturbed the cable without
> > noticing when replacing sdd.
> >
> > MB> If so, then you may have a problem with that specific port. If it
> > MB> kicks disks randomly, and you're sure that your cables or disks are
> > MB> healthy, then it's probably time to change the motherboard.
> >
> > I plan to move to the new atom/pinetrail mainboards as soon as they
> > are available in january. hope that solves this issue. but will check
> > the cable anyway.
> >
> > tnx & cu
> >
> >
> > MB> Increasing the resync values of min will slow down your server if
> > MB> you're trying to access it during a resync.
> >
> > MB> On Wed, Dec 23, 2009 at 6:13 PM, Rainer Fuegenstein
> >
> > MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>> MB> I don't know why your array takes 3 days to resync. My array is
> >>> 7TB in MB> side (8x1TB @ RAID5) and it takes about 16 hours.
> >>>
> >>> that's definitely a big mystery. I put this to this list some time
> >>> ago when upgrading the same array from 4*750GB to 4*1500GB by
> >>> replacing one disk after the other and finally --growing the raid:
> >>>
> >>> 1st disk took just a few minutes
> >>> 2nd disk some hours
> >>> 3rd disk more than a day
> >>> 4th disk about 2+ days
> >>> --grow also took  2+ days
> >>>
> >>> MB> Check the value of this file:
> >>> MB> cat /proc/sys/dev/raid/speed_limit_max
> >>>
> >>> default values are:
> >>> [root@alfred cdrom]# cat /proc/sys/dev/raid/speed_limit_max
> >>> 200000
> >>> [root@alfred cdrom]# cat /proc/sys/dev/raid/speed_limit_min
> >>> 1000
> >>>
> >>> when resyncing (with these default values), the server becomes awfuly
> >>> slow (streaming mp3 via smb suffers timeouts).
> >>>
> >>> mainboard is an Asus M2N with NFORCE-MCP61 chipset.
> >>>
> >>> this server started on an 800MHz asus board with 4*400 GB PATA disks
> >>> and had this one-disk-failure from the start (every few months). over
> >>> the years everything was replaced (power supply, mainboard, disks,
> >>> controller, pata to sata, ...) but it still kicks out disks (with the
> >>> current asus M2N board about every two to three weeks).
> >>>
> >>> must be cosmic radiation to blame ...
> >>>
> >>>
> >>> MB> Make it a high number so that when there's no process querying
> >>> the MB> disks, the resync process will go for the max speed.
> >>> echo '200000' >> /proc/sys/dev/raid/speed_limit_max
> >>> MB> (200 MB/s)
> >>>
> >>> MB> The file /proc/sys/dev/raid/speed_limit_min specified the minimum
> >>> MB> speed at which the array should resync, even when there are other
> >>> MB> programs querying the disks.
> >>>
> >>> MB> Make sure you run the above changes just before you issue a
> >>> resync. MB> Changes are lost on reboot.
> >>>
> >>> MB> On Wed, Dec 23, 2009 at 5:30 PM, Rainer Fuegenstein
> >>>
> >>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>>>> tnx for the info, in the meantime I did:
> >>>>>
> >>>>> mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1
> >>>>> /dev/sdd1
> >>>>>
> >>>>> there was no mdadm.conf file, so I had to specify all devices and
> >>>>> do a --force
> >>>>>
> >>>>>
> >>>>> # cat /proc/mdstat
> >>>>> Personalities : [raid6] [raid5] [raid4]
> >>>>> md0 : active raid5 sdb1[0] sdc1[3] sdd1[1]
> >>>>>      4395407808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
> >>>>>
> >>>>> unused devices: <none>
> >>>>>
> >>>>> md0 is up :-)
> >>>>>
> >>>>> I'm about to start backing up the most important data; when this is
> >>>>> done I assume the proper way to get back to normal again is:
> >>>>>
> >>>>> - remove the bad drive from the array: mdadm /dev/md0 -r /dev/sda1
> >>>>> - physically replace sda with a new drive
> >>>>> - add it back: mdadm /dev/md0 -a /dev/sda1
> >>>>> - wait three days for the sync to complete (and keep fingers
> >>>>> crossed that no other drive fails)
> >>>>>
> >>>>> big tnx!
> >>>>>
> >>>>>
> >>>>> MB> sda1 was the only affected member of the array so you should be
> >>>>> able MB> to force-assemble the raid5 array and run it in degraded
> >>>>> mode.
> >>>>>
> >>>>> MB> mdadm -Af /dev/md0
> >>>>> MB> If that doesn't work for any reason, do this:
> >>>>> MB> mdadm -Af /dev/md0 /dev/sdb1 /dev/sdd1 /dev/sdc1
> >>>>>
> >>>>> MB> You can note the disk order from the output of mdadm -E
> >>>>>
> >>>>> MB> On Wed, Dec 23, 2009 at 5:02 PM, Rainer Fuegenstein
> >>>>>
> >>>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>>>>>> MB> My bad, run this: mdadm -E /dev/sd[a-z]1
> >>>>>>> should have figured this out myself (sorry; currently running in
> >>>>>>> panic mode ;-) )
> >>>>>>>
> >>>>>>> MB> 1 is the partition which most likely you added to the array
> >>>>>>> rather MB> than the whole disk (which is normal).
> >>>>>>>
> >>>>>>> # mdadm -E /dev/sd[a-z]1
> >>>>>>> /dev/sda1:
> >>>>>>>          Magic : a92b4efc
> >>>>>>>        Version : 0.90.00
> >>>>>>>           UUID : 81833582:d651e953:48cc5797:38b256ea
> >>>>>>>  Creation Time : Mon Mar 31 13:30:45 2008
> >>>>>>>     Raid Level : raid5
> >>>>>>>  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
> >>>>>>>     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
> >>>>>>>   Raid Devices : 4
> >>>>>>>  Total Devices : 4
> >>>>>>> Preferred Minor : 0
> >>>>>>>
> >>>>>>>    Update Time : Wed Dec 23 02:54:49 2009
> >>>>>>>          State : clean
> >>>>>>>  Active Devices : 4
> >>>>>>> Working Devices : 4
> >>>>>>>  Failed Devices : 0
> >>>>>>>  Spare Devices : 0
> >>>>>>>       Checksum : 6cfa3a64 - correct
> >>>>>>>         Events : 119530
> >>>>>>>
> >>>>>>>         Layout : left-symmetric
> >>>>>>>     Chunk Size : 64K
> >>>>>>>
> >>>>>>>      Number   Major   Minor   RaidDevice State
> >>>>>>> this     2       8        1        2      active sync   /dev/sda1
> >>>>>>>
> >>>>>>>   0     0       8       17        0      active sync   /dev/sdb1
> >>>>>>>   1     1       8       49        1      active sync   /dev/sdd1
> >>>>>>>   2     2       8        1        2      active sync   /dev/sda1
> >>>>>>>   3     3       8       33        3      active sync   /dev/sdc1
> >>>>>>> /dev/sdb1:
> >>>>>>>          Magic : a92b4efc
> >>>>>>>        Version : 0.90.00
> >>>>>>>           UUID : 81833582:d651e953:48cc5797:38b256ea
> >>>>>>>  Creation Time : Mon Mar 31 13:30:45 2008
> >>>>>>>     Raid Level : raid5
> >>>>>>>  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
> >>>>>>>     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
> >>>>>>>   Raid Devices : 4
> >>>>>>>  Total Devices : 4
> >>>>>>> Preferred Minor : 0
> >>>>>>>
> >>>>>>>    Update Time : Wed Dec 23 10:07:42 2009
> >>>>>>>          State : active
> >>>>>>>  Active Devices : 3
> >>>>>>> Working Devices : 3
> >>>>>>>  Failed Devices : 1
> >>>>>>>  Spare Devices : 0
> >>>>>>>       Checksum : 6cf8f610 - correct
> >>>>>>>         Events : 130037
> >>>>>>>
> >>>>>>>         Layout : left-symmetric
> >>>>>>>     Chunk Size : 64K
> >>>>>>>
> >>>>>>>      Number   Major   Minor   RaidDevice State
> >>>>>>> this     0       8       17        0      active sync   /dev/sdb1
> >>>>>>>
> >>>>>>>   0     0       8       17        0      active sync   /dev/sdb1
> >>>>>>>   1     1       8       49        1      active sync   /dev/sdd1
> >>>>>>>   2     2       0        0        2      faulty removed
> >>>>>>>   3     3       8       33        3      active sync   /dev/sdc1
> >>>>>>> /dev/sdc1:
> >>>>>>>          Magic : a92b4efc
> >>>>>>>        Version : 0.90.00
> >>>>>>>           UUID : 81833582:d651e953:48cc5797:38b256ea
> >>>>>>>  Creation Time : Mon Mar 31 13:30:45 2008
> >>>>>>>     Raid Level : raid5
> >>>>>>>  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
> >>>>>>>     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
> >>>>>>>   Raid Devices : 4
> >>>>>>>  Total Devices : 4
> >>>>>>> Preferred Minor : 0
> >>>>>>>
> >>>>>>>    Update Time : Wed Dec 23 10:07:42 2009
> >>>>>>>          State : active
> >>>>>>>  Active Devices : 3
> >>>>>>> Working Devices : 3
> >>>>>>>  Failed Devices : 1
> >>>>>>>  Spare Devices : 0
> >>>>>>>       Checksum : 6cf8f626 - correct
> >>>>>>>         Events : 130037
> >>>>>>>
> >>>>>>>         Layout : left-symmetric
> >>>>>>>     Chunk Size : 64K
> >>>>>>>
> >>>>>>>      Number   Major   Minor   RaidDevice State
> >>>>>>> this     3       8       33        3      active sync   /dev/sdc1
> >>>>>>>
> >>>>>>>   0     0       8       17        0      active sync   /dev/sdb1
> >>>>>>>   1     1       8       49        1      active sync   /dev/sdd1
> >>>>>>>   2     2       0        0        2      faulty removed
> >>>>>>>   3     3       8       33        3      active sync   /dev/sdc1
> >>>>>>> /dev/sdd1:
> >>>>>>>          Magic : a92b4efc
> >>>>>>>        Version : 0.90.00
> >>>>>>>           UUID : 81833582:d651e953:48cc5797:38b256ea
> >>>>>>>  Creation Time : Mon Mar 31 13:30:45 2008
> >>>>>>>     Raid Level : raid5
> >>>>>>>  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
> >>>>>>>     Array Size : 4395407808 (4191.79 GiB 4500.90 GB)
> >>>>>>>   Raid Devices : 4
> >>>>>>>  Total Devices : 4
> >>>>>>> Preferred Minor : 0
> >>>>>>>
> >>>>>>>    Update Time : Wed Dec 23 10:07:42 2009
> >>>>>>>          State : active
> >>>>>>>  Active Devices : 3
> >>>>>>> Working Devices : 3
> >>>>>>>  Failed Devices : 1
> >>>>>>>  Spare Devices : 0
> >>>>>>>       Checksum : 6cf8f632 - correct
> >>>>>>>         Events : 130037
> >>>>>>>
> >>>>>>>         Layout : left-symmetric
> >>>>>>>     Chunk Size : 64K
> >>>>>>>
> >>>>>>>      Number   Major   Minor   RaidDevice State
> >>>>>>> this     1       8       49        1      active sync   /dev/sdd1
> >>>>>>>
> >>>>>>>   0     0       8       17        0      active sync   /dev/sdb1
> >>>>>>>   1     1       8       49        1      active sync   /dev/sdd1
> >>>>>>>   2     2       0        0        2      faulty removed
> >>>>>>>   3     3       8       33        3      active sync   /dev/sdc1
> >>>>>>> [root@alfred log]#
> >>>>>>>
> >>>>>>> MB> You've included the smart report of one disk only. I suggest
> >>>>>>> you look MB> at the other disks as well and make sure that
> >>>>>>> they're not reporting MB> any errors. Also, keep in mind that you
> >>>>>>> should run smart test MB> periodically (can be configured) and
> >>>>>>> that if you haven't run any test MB> before, you have to run a
> >>>>>>> long or offline test before making sure that MB> you don't have
> >>>>>>> bad sectors.
> >>>>>>>
> >>>>>>> tnx for the hint, will do that as soon as I got my data back (if
> >>>>>>> ever ...)
> >>>>>>>
> >>>>>>>
> >>>>>>> MB> On Wed, Dec 23, 2009 at 4:44 PM, Rainer Fuegenstein
> >>>>>>>
> >>>>>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>>>>>>>> MB> Give the output of these:
> >>>>>>>>> MB> mdadm -E /dev/sd[a-z]
> >>>>>>>>>
> >>>>>>>>> ]# mdadm -E /dev/sd[a-z]
> >>>>>>>>> mdadm: No md superblock detected on /dev/sda.
> >>>>>>>>> mdadm: No md superblock detected on /dev/sdb.
> >>>>>>>>> mdadm: No md superblock detected on /dev/sdc.
> >>>>>>>>> mdadm: No md superblock detected on /dev/sdd.
> >>>>>>>>>
> >>>>>>>>> I assume that's not a good sign ?!
> >>>>>>>>>
> >>>>>>>>> sda was powered on and running after the reboot, a smartctl
> >>>>>>>>> short test revealed no errors and smartctl -a also looks
> >>>>>>>>> unsuspicious (see below). the drives are rather new.
> >>>>>>>>>
> >>>>>>>>> guess its more likely to be either a problem of the power
> >>>>>>>>> supply (400W) or communication between controller and disk.
> >>>>>>>>>
> >>>>>>>>> /dev/sdd (before it was replaced) reported the following:
> >>>>>>>>>
> >>>>>>>>> Dec 20 07:18:54 alfred smartd[2705]: Device: /dev/sdd, 1
> >>>>>>>>> Offline uncorrectable sectors Dec 20 07:48:53 alfred
> >>>>>>>>> smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
> >>>>>>>>> Dec 20 08:18:54 alfred smartd[2705]: Device: /dev/sdd, 1
> >>>>>>>>> Offline uncorrectable sectors Dec 20 08:48:55 alfred
> >>>>>>>>> smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
> >>>>>>>>> Dec 20 09:18:53 alfred smartd[2705]: Device: /dev/sdd, 1
> >>>>>>>>> Offline uncorrectable sectors Dec 20 09:48:58 alfred
> >>>>>>>>> smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
> >>>>>>>>> Dec 20 10:19:01 alfred smartd[2705]: Device: /dev/sdd, 1
> >>>>>>>>> Offline uncorrectable sectors Dec 20 10:48:54 alfred
> >>>>>>>>> smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors
> >>>>>>>>>
> >>>>>>>>> (what triggered a re-sync of the array)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> # smartctl -a /dev/sda
> >>>>>>>>> smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C)
> >>>>>>>>> 2002-8 Bruce Allen Home page is
> >>>>>>>>> http://smartmontools.sourceforge.net/
> >>>>>>>>>
> >>>>>>>>> === START OF INFORMATION SECTION ===
> >>>>>>>>> Device Model:     WDC WD15EADS-00R6B0
> >>>>>>>>> Serial Number:    WD-WCAUP0017818
> >>>>>>>>> Firmware Version: 01.00A01
> >>>>>>>>> User Capacity:    1,500,301,910,016 bytes
> >>>>>>>>> Device is:        Not in smartctl database [for details use: -P
> >>>>>>>>> showall] ATA Version is:   8
> >>>>>>>>> ATA Standard is:  Exact ATA specification draft version not
> >>>>>>>>> indicated Local Time is:    Wed Dec 23 14:40:46 2009 CET
> >>>>>>>>> SMART support is: Available - device has SMART capability.
> >>>>>>>>> SMART support is: Enabled
> >>>>>>>>>
> >>>>>>>>> === START OF READ SMART DATA SECTION ===
> >>>>>>>>> SMART overall-health self-assessment test result: PASSED
> >>>>>>>>>
> >>>>>>>>> General SMART Values:
> >>>>>>>>> Offline data collection status:  (0x82) Offline data collection
> >>>>>>>>> activity was completed without error. Auto Offline Data
> >>>>>>>>> Collection: Enabled. Self-test execution status:      (   0)
> >>>>>>>>> The previous self-test routine completed without error or no
> >>>>>>>>> self-test has ever been run.
> >>>>>>>>> Total time to complete Offline
> >>>>>>>>> data collection:                 (40800) seconds.
> >>>>>>>>> Offline data collection
> >>>>>>>>> capabilities:                    (0x7b) SMART execute Offline
> >>>>>>>>> immediate. Auto Offline data collection on/off support. Suspend
> >>>>>>>>> Offline collection upon new command.
> >>>>>>>>>                                        Offline surface scan
> >>>>>>>>> supported. Self-test supported. Conveyance Self-test supported.
> >>>>>>>>> Selective Self-test supported. SMART capabilities:          
> >>>>>>>>>  (0x0003) Saves SMART data before entering power-saving mode.
> >>>>>>>>> Supports SMART auto save timer. Error logging capability:      
> >>>>>>>>>  (0x01) Error logging supported. General Purpose Logging
> >>>>>>>>> supported. Short self-test routine
> >>>>>>>>> recommended polling time:        (   2) minutes.
> >>>>>>>>> Extended self-test routine
> >>>>>>>>> recommended polling time:        ( 255) minutes.
> >>>>>>>>> Conveyance self-test routine
> >>>>>>>>> recommended polling time:        (   5) minutes.
> >>>>>>>>> SCT capabilities:              (0x303f) SCT Status supported.
> >>>>>>>>>                                        SCT Feature Control
> >>>>>>>>> supported. SCT Data Table supported.
> >>>>>>>>>
> >>>>>>>>> SMART Attributes Data Structure revision number: 16
> >>>>>>>>> Vendor Specific SMART Attributes with Thresholds:
> >>>>>>>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE  
> >>>>>>>>>    UPDATED  WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate    
> >>>>>>>>> 0x002f   200   200   051    Pre-fail  Always       -       0 3
> >>>>>>>>> Spin_Up_Time            0x0027   177   145   021    Pre-fail
> >>>>>>>>>  Always       -       8133 4 Start_Stop_Count        0x0032  
> >>>>>>>>> 100   100   000    Old_age   Always       -       15 5
> >>>>>>>>> Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
> >>>>>>>>>  Always       -       0 7 Seek_Error_Rate         0x002e   200
> >>>>>>>>>   200   000    Old_age   Always       -       0 9
> >>>>>>>>> Power_On_Hours          0x0032   093   093   000    Old_age  
> >>>>>>>>> Always       -       5272 10 Spin_Retry_Count        0x0032  
> >>>>>>>>> 100   253   000    Old_age   Always       -       0 11
> >>>>>>>>> Calibration_Retry_Count 0x0032   100   253   000    Old_age  
> >>>>>>>>> Always       -       0 12 Power_Cycle_Count       0x0032   100
> >>>>>>>>>   100   000    Old_age   Always       -       14 192
> >>>>>>>>> Power-Off_Retract_Count 0x0032   200   200   000    Old_age  
> >>>>>>>>> Always       -       2 193 Load_Cycle_Count        0x0032   200
> >>>>>>>>>   200   000    Old_age   Always       -       13 194
> >>>>>>>>> Temperature_Celsius     0x0022   125   109   000    Old_age  
> >>>>>>>>> Always       -       27 196 Reallocated_Event_Count 0x0032  
> >>>>>>>>> 200   200   000    Old_age   Always       -       0 197
> >>>>>>>>> Current_Pending_Sector  0x0032   200   200   000    Old_age  
> >>>>>>>>> Always       -       0 198 Offline_Uncorrectable   0x0030   200
> >>>>>>>>>   200   000    Old_age   Offline      -       0 199
> >>>>>>>>> UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age  
> >>>>>>>>> Always       -       0 200 Multi_Zone_Error_Rate   0x0008   200
> >>>>>>>>>   200   000    Old_age   Offline      -       0
> >>>>>>>>>
> >>>>>>>>> SMART Error Log Version: 1
> >>>>>>>>> No Errors Logged
> >>>>>>>>>
> >>>>>>>>> SMART Self-test log structure revision number 1
> >>>>>>>>> Num  Test_Description    Status                  Remaining
> >>>>>>>>>  LifeTime(hours)  LBA_of_first_error # 1  Short offline      
> >>>>>>>>> Completed without error       00%      5272         -
> >>>>>>>>>
> >>>>>>>>> SMART Selective self-test log data structure revision number 1
> >>>>>>>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
> >>>>>>>>>    1        0        0  Not_testing
> >>>>>>>>>    2        0        0  Not_testing
> >>>>>>>>>    3        0        0  Not_testing
> >>>>>>>>>    4        0        0  Not_testing
> >>>>>>>>>    5        0        0  Not_testing
> >>>>>>>>> Selective self-test flags (0x0):
> >>>>>>>>>  After scanning selected spans, do NOT read-scan remainder of
> >>>>>>>>> disk. If Selective self-test is pending on power-up, resume
> >>>>>>>>> after 0 minute delay.
> >>>>>>>>>
> >>>>>>>>>>> From the errors you show, it seems like one of the disks is
> >>>>>>>>>>> dead (sda)
> >>>>>>>>>
> >>>>>>>>> MB> or dying. It could be just a bad PCB (the controller board
> >>>>>>>>> of the MB> disk) as it refuses to return SMART data, so you
> >>>>>>>>> might be able to MB> rescue data by changing the PCB, if it's
> >>>>>>>>> that important to have that MB> disk.
> >>>>>>>>>
> >>>>>>>>> MB> As for the array, you can run a degraded array by force
> >>>>>>>>> assembling it: MB> mdadm -Af /dev/md0
> >>>>>>>>> MB> In the command above, mdadm will search on existing disks
> >>>>>>>>> and MB> partitions, which of them belongs to an array and
> >>>>>>>>> assemble that array, MB> if possible.
> >>>>>>>>>
> >>>>>>>>> MB> I also suggest you install smartmontools package and run
> >>>>>>>>> smartctl -a MB> /dev/sd[a-z] and see the report for each disk
> >>>>>>>>> to make sure you don't MB> have bad sectors or bad cables
> >>>>>>>>> (CRC/ATA read errors) on any of the MB> disks.
> >>>>>>>>>
> >>>>>>>>> MB> On Wed, Dec 23, 2009 at 3:50 PM, Rainer Fuegenstein
> >>>>>>>>>
> >>>>>>>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>>>>>>>>>> addendum: when going through the logs I found the reason:
> >>>>>>>>>>>
> >>>>>>>>>>> Dec 23 02:55:40 alfred kernel: ata1.00: exception Emask 0x0
> >>>>>>>>>>> SAct 0x0 SErr 0x0 action 0x6 frozen Dec 23 02:55:40 alfred
> >>>>>>>>>>> kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag
> >>>>>>>>>>> 0 Dec 23 02:55:40 alfred kernel:          res
> >>>>>>>>>>> 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Dec
> >>>>>>>>>>> 23 02:55:40 alfred kernel: ata1.00: status: { DRDY } Dec 23
> >>>>>>>>>>> 02:55:45 alfred kernel: ata1: link is slow to respond, please
> >>>>>>>>>>> be patient (ready=0) Dec 23 02:55:50 alfred kernel: ata1:
> >>>>>>>>>>> device not ready (errno=-16), forcing hardreset Dec 23
> >>>>>>>>>>> 02:55:50 alfred kernel: ata1: soft resetting link Dec 23
> >>>>>>>>>>> 02:55:55 alfred kernel: ata1: link is slow to respond, please
> >>>>>>>>>>> be patient (ready=0) Dec 23 02:56:00 alfred kernel: ata1:
> >>>>>>>>>>> SRST failed (errno=-16) Dec 23 02:56:00 alfred kernel: ata1:
> >>>>>>>>>>> soft resetting link Dec 23 02:56:05 alfred kernel: ata1: link
> >>>>>>>>>>> is slow to respond, please be patient (ready=0) Dec 23
> >>>>>>>>>>> 02:56:10 alfred kernel: ata1: SRST failed (errno=-16) Dec 23
> >>>>>>>>>>> 02:56:10 alfred kernel: ata1: soft resetting link Dec 23
> >>>>>>>>>>> 02:56:15 alfred kernel: ata1: link is slow to respond, please
> >>>>>>>>>>> be patient (ready=0) Dec 23 02:56:45 alfred kernel: ata1:
> >>>>>>>>>>> SRST failed (errno=-16) Dec 23 02:56:45 alfred kernel: ata1:
> >>>>>>>>>>> limiting SATA link speed to 1.5 Gbps Dec 23 02:56:45 alfred
> >>>>>>>>>>> kernel: ata1: soft resetting link Dec 23 02:56:50 alfred
> >>>>>>>>>>> kernel: ata1: SRST failed (errno=-16) Dec 23 02:56:50 alfred
> >>>>>>>>>>> kernel: ata1: reset failed, giving up Dec 23 02:56:50 alfred
> >>>>>>>>>>> kernel: ata1.00: disabled
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: timing out
> >>>>>>>>>>> command, waited 30s Dec 23 02:56:50 alfred kernel: ata1: EH
> >>>>>>>>>>> complete
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return
> >>>>>>>>>>> code = 0x00040000 Dec 23 02:56:50 alfred kernel: end_request:
> >>>>>>>>>>> I/O error, dev sda, sector 1244700223 Dec 23 02:56:50 alfred
> >>>>>>>>>>> kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 Dec
> >>>>>>>>>>> 23 02:56:50 alfred kernel: end_request: I/O error, dev sda,
> >>>>>>>>>>> sector 1554309191 Dec 23 02:56:50 alfred kernel: sd 0:0:0:0:
> >>>>>>>>>>> SCSI error: return code = 0x00040000 Dec 23 02:56:50 alfred
> >>>>>>>>>>> kernel: end_request: I/O error, dev sda, sector 1554309439
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return
> >>>>>>>>>>> code = 0x00040000 Dec 23 02:56:50 alfred kernel: end_request:
> >>>>>>>>>>> I/O error, dev sda, sector 572721343 Dec 23 02:56:50 alfred
> >>>>>>>>>>> kernel: raid5: Disk failure on sda1, disabling device.
> >>>>>>>>>>> Operation continuing on 3 devices Dec 23 02:56:50 alfred
> >>>>>>>>>>> kernel: RAID5 conf printout:
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  --- rd:4 wd:3 fd:1
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 0, o:1, dev:sdb1
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 1, o:1, dev:sdd1
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 2, o:0, dev:sda1
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 3, o:1, dev:sdc1
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel: RAID5 conf printout:
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  --- rd:4 wd:3 fd:1
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 0, o:1, dev:sdb1
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 1, o:1, dev:sdd1
> >>>>>>>>>>> Dec 23 02:56:50 alfred kernel:  disk 3, o:1, dev:sdc1
> >>>>>>>>>>> Dec 23 03:22:57 alfred smartd[2692]: Device: /dev/sda, not
> >>>>>>>>>>> capable of SMART self-check Dec 23 03:22:57 alfred
> >>>>>>>>>>> smartd[2692]: Sending warning via mail to root ... Dec 23
> >>>>>>>>>>> 03:22:58 alfred smartd[2692]: Warning via mail to root:
> >>>>>>>>>>> successful Dec 23 03:22:58 alfred smartd[2692]: Device:
> >>>>>>>>>>> /dev/sda, failed to read SMART Attribute Data Dec 23 03:22:58
> >>>>>>>>>>> alfred smartd[2692]: Sending warning via mail to root ... Dec
> >>>>>>>>>>> 23 03:22:58 alfred smartd[2692]: Warning via mail to root:
> >>>>>>>>>>> successful Dec 23 03:52:57 alfred smartd[2692]: Device:
> >>>>>>>>>>> /dev/sda, not capable of SMART self-check Dec 23 03:52:57
> >>>>>>>>>>> alfred smartd[2692]: Device: /dev/sda, failed to read SMART
> >>>>>>>>>>> Attribute Data Dec 23 04:22:57 alfred smartd[2692]: Device:
> >>>>>>>>>>> /dev/sda, not capable of SMART self-check Dec 23 04:22:57
> >>>>>>>>>>> alfred smartd[2692]: Device: /dev/sda, failed to read SMART
> >>>>>>>>>>> Attribute Data Dec 23 04:52:57 alfred smartd[2692]: Device:
> >>>>>>>>>>> /dev/sda, not capable of SMART self-check [...]
> >>>>>>>>>>> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, not
> >>>>>>>>>>> capable of SMART self-check Dec 23 09:52:57 alfred
> >>>>>>>>>>> smartd[2692]: Device: /dev/sda, failed to read SMART
> >>>>>>>>>>> Attribute Data (crash here)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> RF> hi,
> >>>>>>>>>>>
> >>>>>>>>>>> RF> got a "nice" early christmas present this morning: after
> >>>>>>>>>>> a crash, the raid5 RF> (consisting of 4*1.5TB WD caviar green
> >>>>>>>>>>> SATA disks) won't start :-(
> >>>>>>>>>>>
> >>>>>>>>>>> RF> the history:
> >>>>>>>>>>> RF> sometimes, the raid kicked out one disk, started a resync
> >>>>>>>>>>> (which RF> lasted for about 3 days) and was fine after that.
> >>>>>>>>>>> a few days ago I RF> replaced drive sdd (which seemed to
> >>>>>>>>>>> cause the troubles) and synced the RF> raid again which
> >>>>>>>>>>> finished yesterday in the early afternoon. at 10am RF> today
> >>>>>>>>>>> the system crashed and the raid won't start:
> >>>>>>>>>>>
> >>>>>>>>>>> RF> OS is Centos 5
> >>>>>>>>>>> RF> mdadm - v2.6.9 - 10th March 2009
> >>>>>>>>>>> RF> Linux alfred 2.6.18-164.6.1.el5xen #1 SMP Tue Nov 3
> >>>>>>>>>>> 17:53:47 EST 2009 i686 athlon i386 GNU/Linux
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: Autodetecting RAID
> >>>>>>>>>>> arrays. RF> Dec 23 12:30:19 alfred kernel: md: autorun ...
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: considering sdd1 ...
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdd1 ...
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdc1 ...
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sdb1 ...
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md:  adding sda1 ...
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: created md0
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sda1>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdb1>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdc1>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdd1>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: running:
> >>>>>>>>>>> <sdd1><sdc1><sdb1><sda1> RF> Dec 23 12:30:19 alfred kernel:
> >>>>>>>>>>> md: kicking non-fresh sda1 from array! RF> Dec 23 12:30:19
> >>>>>>>>>>> alfred kernel: md: unbind<sda1>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sda1)
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: md0: raid array is not
> >>>>>>>>>>> clean -- starting background reconstruction RF>     (no
> >>>>>>>>>>> reconstruction is actually started, disks are idle) RF> Dec
> >>>>>>>>>>> 23 12:30:19 alfred kernel: raid5: automatically using best
> >>>>>>>>>>> checksumming function: pIII_sse RF> Dec 23 12:30:19 alfred
> >>>>>>>>>>> kernel:    pIII_sse  :  7085.000 MB/sec RF> Dec 23 12:30:19
> >>>>>>>>>>> alfred kernel: raid5: using function: pIII_sse (7085.000
> >>>>>>>>>>> MB/sec) RF> Dec 23 12:30:19 alfred kernel: raid6: int32x1  
> >>>>>>>>>>>  896 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: int32x2  
> >>>>>>>>>>>  972 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: int32x4  
> >>>>>>>>>>>  893 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: int32x8  
> >>>>>>>>>>>  934 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx1    
> >>>>>>>>>>> 1845 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx2    
> >>>>>>>>>>> 3250 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x1  
> >>>>>>>>>>>  1799 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x2  
> >>>>>>>>>>>  3067 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x1  
> >>>>>>>>>>>  2980 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x2  
> >>>>>>>>>>>  4015 MB/s RF> Dec 23 12:30:19 alfred kernel: raid6: using
> >>>>>>>>>>> algorithm sse2x2 (4015 MB/s) RF> Dec 23 12:30:19 alfred
> >>>>>>>>>>> kernel: md: raid6 personality registered for level 6 RF> Dec
> >>>>>>>>>>> 23 12:30:19 alfred kernel: md: raid5 personality registered
> >>>>>>>>>>> for level 5 RF> Dec 23 12:30:19 alfred kernel: md: raid4
> >>>>>>>>>>> personality registered for level 4 RF> Dec 23 12:30:19 alfred
> >>>>>>>>>>> kernel: raid5: device sdd1 operational as raid disk 1 RF> Dec
> >>>>>>>>>>> 23 12:30:19 alfred kernel: raid5: device sdc1 operational as
> >>>>>>>>>>> raid disk 3 RF> Dec 23 12:30:19 alfred kernel: raid5: device
> >>>>>>>>>>> sdb1 operational as raid disk 0 RF> Dec 23 12:30:19 alfred
> >>>>>>>>>>> kernel: raid5: cannot start dirty degraded array for md0 RF>
> >>>>>>>>>>> Dec 23 12:30:19 alfred kernel: RAID5 conf printout: RF> Dec
> >>>>>>>>>>> 23 12:30:19 alfred kernel:  --- rd:4 wd:3 fd:1 RF> Dec 23
> >>>>>>>>>>> 12:30:19 alfred kernel:  disk 0, o:1, dev:sdb1 RF> Dec 23
> >>>>>>>>>>> 12:30:19 alfred kernel:  disk 1, o:1, dev:sdd1 RF> Dec 23
> >>>>>>>>>>> 12:30:19 alfred kernel:  disk 3, o:1, dev:sdc1 RF> Dec 23
> >>>>>>>>>>> 12:30:19 alfred kernel: raid5: failed to run raid set md0 RF>
> >>>>>>>>>>> Dec 23 12:30:19 alfred kernel: md: pers->run() failed ... RF>
> >>>>>>>>>>> Dec 23 12:30:19 alfred kernel: md: do_md_run() returned -5
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: md0 stopped.
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdd1>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdd1)
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdc1>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdc1)
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdb1>
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdb1)
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: ... autorun DONE.
> >>>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: device-mapper: multipath:
> >>>>>>>>>>> version 1.0.5 loaded
> >>>>>>>>>>>
> >>>>>>>>>>> RF> # cat /proc/mdstat
> >>>>>>>>>>> RF> Personalities : [raid6] [raid5] [raid4]
> >>>>>>>>>>> RF> unused devices: <none>
> >>>>>>>>>>>
> >>>>>>>>>>> RF> filesystem used on top of md0 is xfs.
> >>>>>>>>>>>
> >>>>>>>>>>> RF> please advice what to do next and let me know if you need
> >>>>>>>>>>> further RF> information. really don't want to lose 3TB worth
> >>>>>>>>>>> of data :-(
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> RF> tnx in advance.
> >>>>>>>>>>>
> >>>>>>>>>>> RF> --
> >>>>>>>>>>> RF> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>>>>>> linux-raid" in RF> the body of a message to
> >>>>>>>>>>> majordomo@xxxxxxxxxxxxxxx RF> More majordomo info at
> >>>>>>>>>>>  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -------------------------------------------------------------
> >>>>>>>>>>>----------------- Unix gives you just enough rope to hang
> >>>>>>>>>>> yourself -- and then a couple of more feet, just to be sure.
> >>>>>>>>>>> (Eric Allman)
> >>>>>>>>>>> -------------------------------------------------------------
> >>>>>>>>>>>-----------------
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>>>>>> linux-raid" in the body of a message to
> >>>>>>>>>>> majordomo@xxxxxxxxxxxxxxx
> >>>>>>>>>>> More majordomo info at
> >>>>>>>>>>>  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>
> >>>>>>>>> ---------------------------------------------------------------
> >>>>>>>>>--------------- Unix gives you just enough rope to hang yourself
> >>>>>>>>> -- and then a couple of more feet, just to be sure.
> >>>>>>>>> (Eric Allman)
> >>>>>>>>> ---------------------------------------------------------------
> >>>>>>>>>---------------
> >>>>>>>
> >>>>>>> -----------------------------------------------------------------
> >>>>>>>------------- Unix gives you just enough rope to hang yourself --
> >>>>>>> and then a couple of more feet, just to be sure.
> >>>>>>> (Eric Allman)
> >>>>>>> -----------------------------------------------------------------
> >>>>>>>-------------
> >>>>>>>
> >>>>>>> --
> >>>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>> linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>>>>> More majordomo info at
> >>>>>>>  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>> -------------------------------------------------------------------
> >>>>>----------- Unix gives you just enough rope to hang yourself -- and
> >>>>> then a couple of more feet, just to be sure.
> >>>>> (Eric Allman)
> >>>>> -------------------------------------------------------------------
> >>>>>-----------
> >>>
> >>> ---------------------------------------------------------------------
> >>>--------- Unix gives you just enough rope to hang yourself -- and then
> >>> a couple of more feet, just to be sure.
> >>> (Eric Allman)
> >>> ---------------------------------------------------------------------
> >>>---------
> >
> > -----------------------------------------------------------------------
> >------- Unix gives you just enough rope to hang yourself -- and then a
> > couple of more feet, just to be sure.
> > (Eric Allman)
> > -----------------------------------------------------------------------
> >-------
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Thomas Fjellstrom
tfjellstrom@xxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html