MB> Is the disk being kicked always on the same port? (port 1 for example) not sure how to interpret the syslog messages: Nov 28 21:24:40 alfred kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Nov 28 21:24:40 alfred kernel: ata2.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 Nov 28 21:24:40 alfred kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Nov 28 21:24:40 alfred kernel: ata2.00: status: { DRDY } Nov 28 21:24:40 alfred kernel: ata2: soft resetting link Nov 28 21:24:41 alfred kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Nov 28 21:24:41 alfred kernel: ata2.00: configured for UDMA/133 Nov 28 21:24:41 alfred kernel: ata2: EH complete Nov 28 21:24:41 alfred kernel: SCSI device sdb: 2930277168 512-byte hdwr sectors (1500302 MB) Nov 28 21:24:41 alfred kernel: sdb: Write Protect is off Nov 28 21:24:41 alfred kernel: SCSI device sdb: drive cache: write back Nov 28 21:24:41 alfred smartd[2770]: Device: /dev/sdd, 1 Offline uncorrectable sectors the smartd message for sdd appears frequently, that's why I replaced the drive. the timeout above occured 3 times within the last month for sdb. guess you are right with either the port or the cable. tonight it was sda, but I might have disturbed the cable without noticing when replacing sdd. MB> If so, then you may have a problem with that specific port. If it MB> kicks disks randomly, and you're sure that your cables or disks are MB> healthy, then it's probably time to change the motherboard. I plan to move to the new atom/pinetrail mainboards as soon as they are available in january. hope that solves this issue. but will check the cable anyway. tnx & cu MB> Increasing the resync values of min will slow down your server if MB> you're trying to access it during a resync. MB> On Wed, Dec 23, 2009 at 6:13 PM, Rainer Fuegenstein MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: >> >> MB> I don't know why your array takes 3 days to resync. My array is 7TB in >> MB> side (8x1TB @ RAID5) and it takes about 16 hours. >> >> that's definitely a big mystery. I put this to this list some time ago >> when upgrading the same array from 4*750GB to 4*1500GB by replacing >> one disk after the other and finally --growing the raid: >> >> 1st disk took just a few minutes >> 2nd disk some hours >> 3rd disk more than a day >> 4th disk about 2+ days >> --grow also took 2+ days >> >> MB> Check the value of this file: >> MB> cat /proc/sys/dev/raid/speed_limit_max >> >> default values are: >> [root@alfred cdrom]# cat /proc/sys/dev/raid/speed_limit_max >> 200000 >> [root@alfred cdrom]# cat /proc/sys/dev/raid/speed_limit_min >> 1000 >> >> when resyncing (with these default values), the server becomes awfuly >> slow (streaming mp3 via smb suffers timeouts). >> >> mainboard is an Asus M2N with NFORCE-MCP61 chipset. >> >> this server started on an 800MHz asus board with 4*400 GB PATA disks >> and had this one-disk-failure from the start (every few months). over the >> years everything was replaced (power supply, mainboard, disks, >> controller, pata to sata, ...) but it still kicks out disks (with the >> current asus M2N board about every two to three weeks). >> >> must be cosmic radiation to blame ... >> >> >> MB> Make it a high number so that when there's no process querying the >> MB> disks, the resync process will go for the max speed. >> echo '200000' >> /proc/sys/dev/raid/speed_limit_max >> MB> (200 MB/s) >> >> MB> The file /proc/sys/dev/raid/speed_limit_min specified the minimum >> MB> speed at which the array should resync, even when there are other >> MB> programs querying the disks. >> >> MB> Make sure you run the above changes just before you issue a resync. >> MB> Changes are lost on reboot. >> >> MB> On Wed, Dec 23, 2009 at 5:30 PM, Rainer Fuegenstein >> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: >>>> tnx for the info, in the meantime I did: >>>> >>>> mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 >>>> >>>> there was no mdadm.conf file, so I had to specify all devices and do a >>>> --force >>>> >>>> >>>> # cat /proc/mdstat >>>> Personalities : [raid6] [raid5] [raid4] >>>> md0 : active raid5 sdb1[0] sdc1[3] sdd1[1] >>>> 4395407808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UU_U] >>>> >>>> unused devices: <none> >>>> >>>> md0 is up :-) >>>> >>>> I'm about to start backing up the most important data; when this is >>>> done I assume the proper way to get back to normal again is: >>>> >>>> - remove the bad drive from the array: mdadm /dev/md0 -r /dev/sda1 >>>> - physically replace sda with a new drive >>>> - add it back: mdadm /dev/md0 -a /dev/sda1 >>>> - wait three days for the sync to complete (and keep fingers crossed >>>> that no other drive fails) >>>> >>>> big tnx! >>>> >>>> >>>> MB> sda1 was the only affected member of the array so you should be able >>>> MB> to force-assemble the raid5 array and run it in degraded mode. >>>> >>>> MB> mdadm -Af /dev/md0 >>>> MB> If that doesn't work for any reason, do this: >>>> MB> mdadm -Af /dev/md0 /dev/sdb1 /dev/sdd1 /dev/sdc1 >>>> >>>> MB> You can note the disk order from the output of mdadm -E >>>> >>>> MB> On Wed, Dec 23, 2009 at 5:02 PM, Rainer Fuegenstein >>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> MB> My bad, run this: mdadm -E /dev/sd[a-z]1 >>>>>> should have figured this out myself (sorry; currently running in >>>>>> panic mode ;-) ) >>>>>> >>>>>> MB> 1 is the partition which most likely you added to the array rather >>>>>> MB> than the whole disk (which is normal). >>>>>> >>>>>> # mdadm -E /dev/sd[a-z]1 >>>>>> /dev/sda1: >>>>>> Magic : a92b4efc >>>>>> Version : 0.90.00 >>>>>> UUID : 81833582:d651e953:48cc5797:38b256ea >>>>>> Creation Time : Mon Mar 31 13:30:45 2008 >>>>>> Raid Level : raid5 >>>>>> Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB) >>>>>> Array Size : 4395407808 (4191.79 GiB 4500.90 GB) >>>>>> Raid Devices : 4 >>>>>> Total Devices : 4 >>>>>> Preferred Minor : 0 >>>>>> >>>>>> Update Time : Wed Dec 23 02:54:49 2009 >>>>>> State : clean >>>>>> Active Devices : 4 >>>>>> Working Devices : 4 >>>>>> Failed Devices : 0 >>>>>> Spare Devices : 0 >>>>>> Checksum : 6cfa3a64 - correct >>>>>> Events : 119530 >>>>>> >>>>>> Layout : left-symmetric >>>>>> Chunk Size : 64K >>>>>> >>>>>> Number Major Minor RaidDevice State >>>>>> this 2 8 1 2 active sync /dev/sda1 >>>>>> >>>>>> 0 0 8 17 0 active sync /dev/sdb1 >>>>>> 1 1 8 49 1 active sync /dev/sdd1 >>>>>> 2 2 8 1 2 active sync /dev/sda1 >>>>>> 3 3 8 33 3 active sync /dev/sdc1 >>>>>> /dev/sdb1: >>>>>> Magic : a92b4efc >>>>>> Version : 0.90.00 >>>>>> UUID : 81833582:d651e953:48cc5797:38b256ea >>>>>> Creation Time : Mon Mar 31 13:30:45 2008 >>>>>> Raid Level : raid5 >>>>>> Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB) >>>>>> Array Size : 4395407808 (4191.79 GiB 4500.90 GB) >>>>>> Raid Devices : 4 >>>>>> Total Devices : 4 >>>>>> Preferred Minor : 0 >>>>>> >>>>>> Update Time : Wed Dec 23 10:07:42 2009 >>>>>> State : active >>>>>> Active Devices : 3 >>>>>> Working Devices : 3 >>>>>> Failed Devices : 1 >>>>>> Spare Devices : 0 >>>>>> Checksum : 6cf8f610 - correct >>>>>> Events : 130037 >>>>>> >>>>>> Layout : left-symmetric >>>>>> Chunk Size : 64K >>>>>> >>>>>> Number Major Minor RaidDevice State >>>>>> this 0 8 17 0 active sync /dev/sdb1 >>>>>> >>>>>> 0 0 8 17 0 active sync /dev/sdb1 >>>>>> 1 1 8 49 1 active sync /dev/sdd1 >>>>>> 2 2 0 0 2 faulty removed >>>>>> 3 3 8 33 3 active sync /dev/sdc1 >>>>>> /dev/sdc1: >>>>>> Magic : a92b4efc >>>>>> Version : 0.90.00 >>>>>> UUID : 81833582:d651e953:48cc5797:38b256ea >>>>>> Creation Time : Mon Mar 31 13:30:45 2008 >>>>>> Raid Level : raid5 >>>>>> Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB) >>>>>> Array Size : 4395407808 (4191.79 GiB 4500.90 GB) >>>>>> Raid Devices : 4 >>>>>> Total Devices : 4 >>>>>> Preferred Minor : 0 >>>>>> >>>>>> Update Time : Wed Dec 23 10:07:42 2009 >>>>>> State : active >>>>>> Active Devices : 3 >>>>>> Working Devices : 3 >>>>>> Failed Devices : 1 >>>>>> Spare Devices : 0 >>>>>> Checksum : 6cf8f626 - correct >>>>>> Events : 130037 >>>>>> >>>>>> Layout : left-symmetric >>>>>> Chunk Size : 64K >>>>>> >>>>>> Number Major Minor RaidDevice State >>>>>> this 3 8 33 3 active sync /dev/sdc1 >>>>>> >>>>>> 0 0 8 17 0 active sync /dev/sdb1 >>>>>> 1 1 8 49 1 active sync /dev/sdd1 >>>>>> 2 2 0 0 2 faulty removed >>>>>> 3 3 8 33 3 active sync /dev/sdc1 >>>>>> /dev/sdd1: >>>>>> Magic : a92b4efc >>>>>> Version : 0.90.00 >>>>>> UUID : 81833582:d651e953:48cc5797:38b256ea >>>>>> Creation Time : Mon Mar 31 13:30:45 2008 >>>>>> Raid Level : raid5 >>>>>> Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB) >>>>>> Array Size : 4395407808 (4191.79 GiB 4500.90 GB) >>>>>> Raid Devices : 4 >>>>>> Total Devices : 4 >>>>>> Preferred Minor : 0 >>>>>> >>>>>> Update Time : Wed Dec 23 10:07:42 2009 >>>>>> State : active >>>>>> Active Devices : 3 >>>>>> Working Devices : 3 >>>>>> Failed Devices : 1 >>>>>> Spare Devices : 0 >>>>>> Checksum : 6cf8f632 - correct >>>>>> Events : 130037 >>>>>> >>>>>> Layout : left-symmetric >>>>>> Chunk Size : 64K >>>>>> >>>>>> Number Major Minor RaidDevice State >>>>>> this 1 8 49 1 active sync /dev/sdd1 >>>>>> >>>>>> 0 0 8 17 0 active sync /dev/sdb1 >>>>>> 1 1 8 49 1 active sync /dev/sdd1 >>>>>> 2 2 0 0 2 faulty removed >>>>>> 3 3 8 33 3 active sync /dev/sdc1 >>>>>> [root@alfred log]# >>>>>> >>>>>> MB> You've included the smart report of one disk only. I suggest you look >>>>>> MB> at the other disks as well and make sure that they're not reporting >>>>>> MB> any errors. Also, keep in mind that you should run smart test >>>>>> MB> periodically (can be configured) and that if you haven't run any test >>>>>> MB> before, you have to run a long or offline test before making sure that >>>>>> MB> you don't have bad sectors. >>>>>> >>>>>> tnx for the hint, will do that as soon as I got my data back (if ever >>>>>> ...) >>>>>> >>>>>> >>>>>> MB> On Wed, Dec 23, 2009 at 4:44 PM, Rainer Fuegenstein >>>>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> MB> Give the output of these: >>>>>>>> MB> mdadm -E /dev/sd[a-z] >>>>>>>> >>>>>>>> ]# mdadm -E /dev/sd[a-z] >>>>>>>> mdadm: No md superblock detected on /dev/sda. >>>>>>>> mdadm: No md superblock detected on /dev/sdb. >>>>>>>> mdadm: No md superblock detected on /dev/sdc. >>>>>>>> mdadm: No md superblock detected on /dev/sdd. >>>>>>>> >>>>>>>> I assume that's not a good sign ?! >>>>>>>> >>>>>>>> sda was powered on and running after the reboot, a smartctl short test >>>>>>>> revealed no errors and smartctl -a also looks unsuspicious (see >>>>>>>> below). the drives are rather new. >>>>>>>> >>>>>>>> guess its more likely to be either a problem of the power supply >>>>>>>> (400W) or communication between controller and disk. >>>>>>>> >>>>>>>> /dev/sdd (before it was replaced) reported the following: >>>>>>>> >>>>>>>> Dec 20 07:18:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors >>>>>>>> Dec 20 07:48:53 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors >>>>>>>> Dec 20 08:18:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors >>>>>>>> Dec 20 08:48:55 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors >>>>>>>> Dec 20 09:18:53 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors >>>>>>>> Dec 20 09:48:58 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors >>>>>>>> Dec 20 10:19:01 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors >>>>>>>> Dec 20 10:48:54 alfred smartd[2705]: Device: /dev/sdd, 1 Offline uncorrectable sectors >>>>>>>> >>>>>>>> (what triggered a re-sync of the array) >>>>>>>> >>>>>>>> >>>>>>>> # smartctl -a /dev/sda >>>>>>>> smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen >>>>>>>> Home page is http://smartmontools.sourceforge.net/ >>>>>>>> >>>>>>>> === START OF INFORMATION SECTION === >>>>>>>> Device Model: WDC WD15EADS-00R6B0 >>>>>>>> Serial Number: WD-WCAUP0017818 >>>>>>>> Firmware Version: 01.00A01 >>>>>>>> User Capacity: 1,500,301,910,016 bytes >>>>>>>> Device is: Not in smartctl database [for details use: -P showall] >>>>>>>> ATA Version is: 8 >>>>>>>> ATA Standard is: Exact ATA specification draft version not indicated >>>>>>>> Local Time is: Wed Dec 23 14:40:46 2009 CET >>>>>>>> SMART support is: Available - device has SMART capability. >>>>>>>> SMART support is: Enabled >>>>>>>> >>>>>>>> === START OF READ SMART DATA SECTION === >>>>>>>> SMART overall-health self-assessment test result: PASSED >>>>>>>> >>>>>>>> General SMART Values: >>>>>>>> Offline data collection status: (0x82) Offline data collection activity >>>>>>>> was completed without error. >>>>>>>> Auto Offline Data Collection: Enabled. >>>>>>>> Self-test execution status: ( 0) The previous self-test routine completed >>>>>>>> without error or no self-test has ever >>>>>>>> been run. >>>>>>>> Total time to complete Offline >>>>>>>> data collection: (40800) seconds. >>>>>>>> Offline data collection >>>>>>>> capabilities: (0x7b) SMART execute Offline immediate. >>>>>>>> Auto Offline data collection on/off support. >>>>>>>> Suspend Offline collection upon new >>>>>>>> command. >>>>>>>> Offline surface scan supported. >>>>>>>> Self-test supported. >>>>>>>> Conveyance Self-test supported. >>>>>>>> Selective Self-test supported. >>>>>>>> SMART capabilities: (0x0003) Saves SMART data before entering >>>>>>>> power-saving mode. >>>>>>>> Supports SMART auto save timer. >>>>>>>> Error logging capability: (0x01) Error logging supported. >>>>>>>> General Purpose Logging supported. >>>>>>>> Short self-test routine >>>>>>>> recommended polling time: ( 2) minutes. >>>>>>>> Extended self-test routine >>>>>>>> recommended polling time: ( 255) minutes. >>>>>>>> Conveyance self-test routine >>>>>>>> recommended polling time: ( 5) minutes. >>>>>>>> SCT capabilities: (0x303f) SCT Status supported. >>>>>>>> SCT Feature Control supported. >>>>>>>> SCT Data Table supported. >>>>>>>> >>>>>>>> SMART Attributes Data Structure revision number: 16 >>>>>>>> Vendor Specific SMART Attributes with Thresholds: >>>>>>>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE >>>>>>>> 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 >>>>>>>> 3 Spin_Up_Time 0x0027 177 145 021 Pre-fail Always - 8133 >>>>>>>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15 >>>>>>>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 >>>>>>>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 >>>>>>>> 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 5272 >>>>>>>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 >>>>>>>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 >>>>>>>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14 >>>>>>>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 2 >>>>>>>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 13 >>>>>>>> 194 Temperature_Celsius 0x0022 125 109 000 Old_age Always - 27 >>>>>>>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 >>>>>>>> 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 >>>>>>>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 >>>>>>>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 >>>>>>>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 >>>>>>>> >>>>>>>> SMART Error Log Version: 1 >>>>>>>> No Errors Logged >>>>>>>> >>>>>>>> SMART Self-test log structure revision number 1 >>>>>>>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error >>>>>>>> # 1 Short offline Completed without error 00% 5272 - >>>>>>>> >>>>>>>> SMART Selective self-test log data structure revision number 1 >>>>>>>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >>>>>>>> 1 0 0 Not_testing >>>>>>>> 2 0 0 Not_testing >>>>>>>> 3 0 0 Not_testing >>>>>>>> 4 0 0 Not_testing >>>>>>>> 5 0 0 Not_testing >>>>>>>> Selective self-test flags (0x0): >>>>>>>> After scanning selected spans, do NOT read-scan remainder of disk. >>>>>>>> If Selective self-test is pending on power-up, resume after 0 minute delay. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>>From the errors you show, it seems like one of the disks is dead (sda) >>>>>>>> MB> or dying. It could be just a bad PCB (the controller board of the >>>>>>>> MB> disk) as it refuses to return SMART data, so you might be able to >>>>>>>> MB> rescue data by changing the PCB, if it's that important to have that >>>>>>>> MB> disk. >>>>>>>> >>>>>>>> MB> As for the array, you can run a degraded array by force assembling it: >>>>>>>> MB> mdadm -Af /dev/md0 >>>>>>>> MB> In the command above, mdadm will search on existing disks and >>>>>>>> MB> partitions, which of them belongs to an array and assemble that array, >>>>>>>> MB> if possible. >>>>>>>> >>>>>>>> MB> I also suggest you install smartmontools package and run smartctl -a >>>>>>>> MB> /dev/sd[a-z] and see the report for each disk to make sure you don't >>>>>>>> MB> have bad sectors or bad cables (CRC/ATA read errors) on any of the >>>>>>>> MB> disks. >>>>>>>> >>>>>>>> MB> On Wed, Dec 23, 2009 at 3:50 PM, Rainer Fuegenstein >>>>>>>> MB> <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> wrote: >>>>>>>>>> addendum: when going through the logs I found the reason: >>>>>>>>>> >>>>>>>>>> Dec 23 02:55:40 alfred kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen >>>>>>>>>> Dec 23 02:55:40 alfred kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 >>>>>>>>>> Dec 23 02:55:40 alfred kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) >>>>>>>>>> Dec 23 02:55:40 alfred kernel: ata1.00: status: { DRDY } >>>>>>>>>> Dec 23 02:55:45 alfred kernel: ata1: link is slow to respond, please be patient (ready=0) >>>>>>>>>> Dec 23 02:55:50 alfred kernel: ata1: device not ready (errno=-16), forcing hardreset >>>>>>>>>> Dec 23 02:55:50 alfred kernel: ata1: soft resetting link >>>>>>>>>> Dec 23 02:55:55 alfred kernel: ata1: link is slow to respond, please be patient (ready=0) >>>>>>>>>> Dec 23 02:56:00 alfred kernel: ata1: SRST failed (errno=-16) >>>>>>>>>> Dec 23 02:56:00 alfred kernel: ata1: soft resetting link >>>>>>>>>> Dec 23 02:56:05 alfred kernel: ata1: link is slow to respond, please be patient (ready=0) >>>>>>>>>> Dec 23 02:56:10 alfred kernel: ata1: SRST failed (errno=-16) >>>>>>>>>> Dec 23 02:56:10 alfred kernel: ata1: soft resetting link >>>>>>>>>> Dec 23 02:56:15 alfred kernel: ata1: link is slow to respond, please be patient (ready=0) >>>>>>>>>> Dec 23 02:56:45 alfred kernel: ata1: SRST failed (errno=-16) >>>>>>>>>> Dec 23 02:56:45 alfred kernel: ata1: limiting SATA link speed to 1.5 Gbps >>>>>>>>>> Dec 23 02:56:45 alfred kernel: ata1: soft resetting link >>>>>>>>>> Dec 23 02:56:50 alfred kernel: ata1: SRST failed (errno=-16) >>>>>>>>>> Dec 23 02:56:50 alfred kernel: ata1: reset failed, giving up >>>>>>>>>> Dec 23 02:56:50 alfred kernel: ata1.00: disabled >>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: timing out command, waited 30s >>>>>>>>>> Dec 23 02:56:50 alfred kernel: ata1: EH complete >>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1244700223 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1554309191 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 1554309439 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: end_request: I/O error, dev sda, sector 572721343 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: raid5: Disk failure on sda1, disabling device. Operation continuing on 3 devices >>>>>>>>>> Dec 23 02:56:50 alfred kernel: RAID5 conf printout: >>>>>>>>>> Dec 23 02:56:50 alfred kernel: --- rd:4 wd:3 fd:1 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: disk 0, o:1, dev:sdb1 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: disk 1, o:1, dev:sdd1 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: disk 2, o:0, dev:sda1 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: disk 3, o:1, dev:sdc1 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: RAID5 conf printout: >>>>>>>>>> Dec 23 02:56:50 alfred kernel: --- rd:4 wd:3 fd:1 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: disk 0, o:1, dev:sdb1 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: disk 1, o:1, dev:sdd1 >>>>>>>>>> Dec 23 02:56:50 alfred kernel: disk 3, o:1, dev:sdc1 >>>>>>>>>> Dec 23 03:22:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >>>>>>>>>> Dec 23 03:22:57 alfred smartd[2692]: Sending warning via mail to root ... >>>>>>>>>> Dec 23 03:22:58 alfred smartd[2692]: Warning via mail to root: successful >>>>>>>>>> Dec 23 03:22:58 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data >>>>>>>>>> Dec 23 03:22:58 alfred smartd[2692]: Sending warning via mail to root ... >>>>>>>>>> Dec 23 03:22:58 alfred smartd[2692]: Warning via mail to root: successful >>>>>>>>>> Dec 23 03:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >>>>>>>>>> Dec 23 03:52:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data >>>>>>>>>> Dec 23 04:22:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >>>>>>>>>> Dec 23 04:22:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data >>>>>>>>>> Dec 23 04:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >>>>>>>>>> [...] >>>>>>>>>> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, not capable of SMART self-check >>>>>>>>>> Dec 23 09:52:57 alfred smartd[2692]: Device: /dev/sda, failed to read SMART Attribute Data >>>>>>>>>> (crash here) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> RF> hi, >>>>>>>>>> >>>>>>>>>> RF> got a "nice" early christmas present this morning: after a crash, the raid5 >>>>>>>>>> RF> (consisting of 4*1.5TB WD caviar green SATA disks) won't start :-( >>>>>>>>>> >>>>>>>>>> RF> the history: >>>>>>>>>> RF> sometimes, the raid kicked out one disk, started a resync (which >>>>>>>>>> RF> lasted for about 3 days) and was fine after that. a few days ago I >>>>>>>>>> RF> replaced drive sdd (which seemed to cause the troubles) and synced the >>>>>>>>>> RF> raid again which finished yesterday in the early afternoon. at 10am >>>>>>>>>> RF> today the system crashed and the raid won't start: >>>>>>>>>> >>>>>>>>>> RF> OS is Centos 5 >>>>>>>>>> RF> mdadm - v2.6.9 - 10th March 2009 >>>>>>>>>> RF> Linux alfred 2.6.18-164.6.1.el5xen #1 SMP Tue Nov 3 17:53:47 EST 2009 i686 athlon i386 GNU/Linux >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: Autodetecting RAID arrays. >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: autorun ... >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: considering sdd1 ... >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: adding sdd1 ... >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: adding sdc1 ... >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: adding sdb1 ... >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: adding sda1 ... >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: created md0 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sda1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdb1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdc1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: bind<sdd1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: running: <sdd1><sdc1><sdb1><sda1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: kicking non-fresh sda1 from array! >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sda1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sda1) >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: md0: raid array is not clean -- starting background reconstruction >>>>>>>>>> RF> (no reconstruction is actually started, disks are idle) >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: automatically using best checksumming function: pIII_sse >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: pIII_sse : 7085.000 MB/sec >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: using function: pIII_sse (7085.000 MB/sec) >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x1 896 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x2 972 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x4 893 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: int32x8 934 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx1 1845 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: mmxx2 3250 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x1 1799 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse1x2 3067 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x1 2980 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: sse2x2 4015 MB/s >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid6: using algorithm sse2x2 (4015 MB/s) >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: raid6 personality registered for level 6 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: raid5 personality registered for level 5 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: raid4 personality registered for level 4 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdd1 operational as raid disk 1 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdc1 operational as raid disk 3 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: device sdb1 operational as raid disk 0 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: cannot start dirty degraded array for md0 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: RAID5 conf printout: >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: --- rd:4 wd:3 fd:1 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: disk 0, o:1, dev:sdb1 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: disk 1, o:1, dev:sdd1 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: disk 3, o:1, dev:sdc1 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: raid5: failed to run raid set md0 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: pers->run() failed ... >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: do_md_run() returned -5 >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: md0 stopped. >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdd1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdd1) >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdc1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdc1) >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: unbind<sdb1> >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: export_rdev(sdb1) >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: md: ... autorun DONE. >>>>>>>>>> RF> Dec 23 12:30:19 alfred kernel: device-mapper: multipath: version 1.0.5 loaded >>>>>>>>>> >>>>>>>>>> RF> # cat /proc/mdstat >>>>>>>>>> RF> Personalities : [raid6] [raid5] [raid4] >>>>>>>>>> RF> unused devices: <none> >>>>>>>>>> >>>>>>>>>> RF> filesystem used on top of md0 is xfs. >>>>>>>>>> >>>>>>>>>> RF> please advice what to do next and let me know if you need further >>>>>>>>>> RF> information. really don't want to lose 3TB worth of data :-( >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> RF> tnx in advance. >>>>>>>>>> >>>>>>>>>> RF> -- >>>>>>>>>> RF> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>>>>>>>>> RF> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> RF> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> Unix gives you just enough rope to hang yourself -- and then a couple of more >>>>>>>>>> feet, just to be sure. >>>>>>>>>> (Eric Allman) >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Unix gives you just enough rope to hang yourself -- and then a couple of more >>>>>>>> feet, just to be sure. >>>>>>>> (Eric Allman) >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Unix gives you just enough rope to hang yourself -- and then a couple of more >>>>>> feet, just to be sure. >>>>>> (Eric Allman) >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Unix gives you just enough rope to hang yourself -- and then a couple of more >>>> feet, just to be sure. >>>> (Eric Allman) >>>> ------------------------------------------------------------------------------ >>>> >>>> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> Unix gives you just enough rope to hang yourself -- and then a couple of more >> feet, just to be sure. >> (Eric Allman) >> ------------------------------------------------------------------------------ >> >> ------------------------------------------------------------------------------ Unix gives you just enough rope to hang yourself -- and then a couple of more feet, just to be sure. (Eric Allman) ------------------------------------------------------------------------------ -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html