On 31 July 2011 19:05, Mathias Burén <mathias.buren@xxxxxxxxx> wrote: > On 31 July 2011 18:59, Mathias Burén <mathias.buren@xxxxxxxxx> wrote: >> On 31 July 2011 14:05, Mathias Burén <mathias.buren@xxxxxxxxx> wrote: >>> Hi list, >>> >>> Here's the output of my weekly script: >>> >>> DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END >>> sdb1 6158767 0 0 0 2 0 0 >>> sdc1 6158767 0 0 0 0 0 0 >>> sdd1 6158767 0 0 0 0 0 0 >>> sde1 6158767 0 0 0 0 0 1 >>> sdf1 6158767 0 0 0 0 47 6 >>> sdg1 6158767 0 0 0 0 0 0 >>> sdh1 6158767 0 6 0 0 340 3 >>> >>> >>> Personalities : [raid6] [raid5] [raid4] >>> md0 : active raid6 sdf1[5] sdh1[6] sdg1[0] sde1[7] sdc1[3] sdd1[4] sdb1[1] >>> 9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2 >>> [7/7] [UUUUUUU] >>> >>> unused devices: <none> >>> >>> >>> /dev/md0: >>> Version : 1.2 >>> Creation Time : Tue Oct 19 08:58:41 2010 >>> Raid Level : raid6 >>> Array Size : 9751756800 (9300.00 GiB 9985.80 GB) >>> Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB) >>> Raid Devices : 7 >>> Total Devices : 7 >>> Persistence : Superblock is persistent >>> >>> Update Time : Sun Jul 31 09:50:43 2011 >>> State : clean >>> Active Devices : 7 >>> Working Devices : 7 >>> Failed Devices : 0 >>> Spare Devices : 0 >>> >>> Layout : left-symmetric >>> Chunk Size : 64K >>> >>> Name : ion:0 (local to host ion) >>> UUID : e6595c64:b3ae90b3:f01133ac:3f402d20 >>> Events : 6158767 >>> >>> Number Major Minor RaidDevice State >>> 0 8 97 0 active sync /dev/sdg1 >>> 1 8 17 1 active sync /dev/sdb1 >>> 4 8 49 2 active sync /dev/sdd1 >>> 3 8 33 3 active sync /dev/sdc1 >>> 5 8 81 4 active sync /dev/sdf1 >>> 6 8 113 5 active sync /dev/sdh1 >>> 7 8 65 6 active sync /dev/sde1 >>> >>> Here's the SMART data for sdh: >>> >>> >>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local build) >>> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net >>> >>> === START OF INFORMATION SECTION === >>> Model Family: SAMSUNG SpinPoint F4 EG (AFT) >>> Device Model: SAMSUNG HD204UI >>> Serial Number: S2HGJ1RZ800850 >>> LU WWN Device Id: 5 0024e9 003f1ebc9 >>> Firmware Version: 1AQ10003 >>> User Capacity: 2,000,398,934,016 bytes [2.00 TB] >>> Sector Size: 512 bytes logical/physical >>> Device is: In smartctl database [for details use: -P show] >>> ATA Version is: 8 >>> ATA Standard is: ATA-8-ACS revision 6 >>> Local Time is: Sun Jul 31 14:03:32 2011 IST >>> >>> ==> WARNING: Using smartmontools or hdparm with this >>> drive may result in data loss due to a firmware bug. >>> ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ****** >>> Buggy and fixed firmware report same version number! >>> See the following web pages for details: >>> http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386 >>> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks >>> >>> SMART support is: Available - device has SMART capability. >>> SMART support is: Enabled >>> >>> === START OF READ SMART DATA SECTION === >>> SMART overall-health self-assessment test result: PASSED >>> >>> General SMART Values: >>> Offline data collection status: (0x82) Offline data collection activity >>> was completed without error. >>> Auto Offline Data Collection: Enabled. >>> Self-test execution status: ( 37) The self-test routine was interrupted >>> by the host with a hard or soft reset. >>> Total time to complete Offline >>> data collection: (20640) seconds. >>> Offline data collection >>> capabilities: (0x5b) SMART execute Offline immediate. >>> Auto Offline data collection on/off support. >>> Suspend Offline collection upon new >>> command. >>> Offline surface scan supported. >>> Self-test supported. >>> No Conveyance Self-test supported. >>> Selective Self-test supported. >>> SMART capabilities: (0x0003) Saves SMART data before entering >>> power-saving mode. >>> Supports SMART auto save timer. >>> Error logging capability: (0x01) Error logging supported. >>> General Purpose Logging supported. >>> Short self-test routine >>> recommended polling time: ( 2) minutes. >>> Extended self-test routine >>> recommended polling time: ( 255) minutes. >>> SCT capabilities: (0x003f) SCT Status supported. >>> SCT Error Recovery Control supported. >>> SCT Feature Control supported. >>> SCT Data Table supported. >>> >>> SMART Attributes Data Structure revision number: 16 >>> Vendor Specific SMART Attributes with Thresholds: >>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >>> UPDATED WHEN_FAILED RAW_VALUE >>> 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail >>> Always - 340 >>> 2 Throughput_Performance 0x0026 055 053 000 Old_age >>> Always - 18989 >>> 3 Spin_Up_Time 0x0023 067 044 025 Pre-fail >>> Always - 10165 >>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >>> Always - 18 >>> 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail >>> Always - 0 >>> 7 Seek_Error_Rate 0x002e 252 252 051 Old_age >>> Always - 0 >>> 8 Seek_Time_Performance 0x0024 252 252 015 Old_age >>> Offline - 0 >>> 9 Power_On_Hours 0x0032 100 100 000 Old_age >>> Always - 6447 >>> 10 Spin_Retry_Count 0x0032 252 252 051 Old_age >>> Always - 0 >>> 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age >>> Always - 0 >>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >>> Always - 20 >>> 181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age >>> Always - 10117271 >>> 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age >>> Always - 1 >>> 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age >>> Always - 0 >>> 194 Temperature_Celsius 0x0002 064 057 000 Old_age >>> Always - 35 (Min/Max 16/43) >>> 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age >>> Always - 0 >>> 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age >>> Always - 0 >>> 197 Current_Pending_Sector 0x0032 100 100 000 Old_age >>> Always - 6 >>> 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age >>> Offline - 0 >>> 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age >>> Always - 0 >>> 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age >>> Always - 3 >>> 223 Load_Retry_Count 0x0032 252 252 000 Old_age >>> Always - 0 >>> 225 Load_Cycle_Count 0x0032 100 100 000 Old_age >>> Always - 21 >>> >>> SMART Error Log Version: 1 >>> No Errors Logged >>> >>> SMART Self-test log structure revision number 1 >>> Num Test_Description Status Remaining >>> LifeTime(hours) LBA_of_first_error >>> # 1 Extended offline Interrupted (host reset) 50% 6408 - >>> # 2 Extended offline Completed without error 00% 6317 - >>> # 3 Extended offline Completed without error 00% 6260 - >>> # 4 Extended offline Completed without error 00% 6232 - >>> # 5 Extended offline Completed without error 00% 6170 - >>> # 6 Extended offline Completed without error 00% 6064 - >>> # 7 Extended offline Completed without error 00% 6029 - >>> # 8 Extended offline Completed without error 00% 5898 - >>> # 9 Extended offline Aborted by host 60% 5893 - >>> #10 Extended offline Completed without error 00% 5728 - >>> #11 Extended offline Completed without error 00% 5706 - >>> #12 Extended offline Interrupted (host reset) 40% 5701 - >>> #13 Extended offline Interrupted (host reset) 90% 5666 - >>> #14 Extended offline Completed without error 00% 5560 - >>> #15 Extended offline Completed without error 00% 5527 - >>> #16 Extended offline Completed without error 00% 5392 - >>> #17 Extended offline Completed without error 00% 5357 - >>> #18 Extended offline Completed without error 00% 5250 - >>> #19 Extended offline Completed without error 00% 4272 - >>> #20 Extended offline Completed without error 00% 4017 - >>> #21 Extended offline Completed without error 00% 3935 - >>> >>> Note: selective self-test log revision number (0) not 1 implies that >>> no selective self-test has ever been run >>> SMART Selective self-test log data structure revision number 0 >>> Note: revision number not 1 implies that no selective self-test has >>> ever been run >>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >>> 1 0 0 Interrupted [50% left] (0-65535) >>> 2 0 0 Not_testing >>> 3 0 0 Not_testing >>> 4 0 0 Not_testing >>> 5 0 0 Not_testing >>> Selective self-test flags (0x0): >>> After scanning selected spans, do NOT read-scan remainder of disk. >>> If Selective self-test is pending on power-up, resume after 0 minute delay. >>> >>> >>> It has 6 pending sectors. Why are they not reallocated? Can I force >>> this somehow? (a scrub did not reallocate them) Is this enough to >>> replace the HDD? >>> >>> Thanks, >>> Mathias >>> >> >> Uh oh, I did another scrub, and here's the status: >> >> DEV EVENTS REALL PEND UNCORR CRC RAW ZONE END >> sdb1 6158767 0 0 0 2 0 0 >> sdc1 6158767 0 0 0 0 0 0 >> sdd1 6158767 0 0 0 0 0 0 >> sde1 6158767 0 0 0 0 0 1 >> sdf1 6158768 0 0 0 0 47 6 >> sdg1 6158767 0 0 0 0 0 0 >> sdh1 6158767 0 8 1 0 341 3 >> >> >> Personalities : [raid6] [raid5] [raid4] >> md0 : active raid6 sdh1[6] sdg1[0] sdf1[5] sde1[7] sdd1[4] sdb1[1] sdc1[3] >> 9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2 >> [7/7] [UUUUUUU] >> >> unused devices: <none> >> >> >> /dev/md0: >> Version : 1.2 >> Creation Time : Tue Oct 19 08:58:41 2010 >> Raid Level : raid6 >> Array Size : 9751756800 (9300.00 GiB 9985.80 GB) >> Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB) >> Raid Devices : 7 >> Total Devices : 7 >> Persistence : Superblock is persistent >> >> Update Time : Sun Jul 31 18:51:58 2011 >> State : clean >> Active Devices : 7 >> Working Devices : 7 >> Failed Devices : 0 >> Spare Devices : 0 >> >> Layout : left-symmetric >> Chunk Size : 64K >> >> Name : ion:0 (local to host ion) >> UUID : e6595c64:b3ae90b3:f01133ac:3f402d20 >> Events : 6158767 >> >> Number Major Minor RaidDevice State >> 0 8 97 0 active sync /dev/sdg1 >> 1 8 17 1 active sync /dev/sdb1 >> 4 8 49 2 active sync /dev/sdd1 >> 3 8 33 3 active sync /dev/sdc1 >> 5 8 81 4 active sync /dev/sdf1 >> 6 8 113 5 active sync /dev/sdh1 >> 7 8 65 6 active sync /dev/sde1 >> >> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.39-ck] (local build) >> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net >> >> === START OF INFORMATION SECTION === >> Model Family: SAMSUNG SpinPoint F4 EG (AFT) >> Device Model: SAMSUNG HD204UI >> Serial Number: S2HGJ1RZ800850 >> LU WWN Device Id: 5 0024e9 003f1ebc9 >> Firmware Version: 1AQ10003 >> User Capacity: 2,000,398,934,016 bytes [2.00 TB] >> Sector Size: 512 bytes logical/physical >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: 8 >> ATA Standard is: ATA-8-ACS revision 6 >> Local Time is: Sun Jul 31 18:51:59 2011 IST >> >> ==> WARNING: Using smartmontools or hdparm with this >> drive may result in data loss due to a firmware bug. >> ****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ****** >> Buggy and fixed firmware report same version number! >> See the following web pages for details: >> http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386 >> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks >> >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x80) Offline data collection activity >> was never started. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 118) The previous self-test completed having >> the read element of the test failed. >> Total time to complete Offline >> data collection: (20640) seconds. >> Offline data collection >> capabilities: (0x5b) SMART execute Offline immediate. >> Auto Offline data collection on/off support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> No Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 2) minutes. >> Extended self-test routine >> recommended polling time: ( 255) minutes. >> SCT capabilities: (0x003f) SCT Status supported. >> SCT Error Recovery Control supported. >> SCT Feature Control supported. >> SCT Data Table supported. >> >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >> UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail >> Always - 341 >> 2 Throughput_Performance 0x0026 055 053 000 Old_age >> Always - 18989 >> 3 Spin_Up_Time 0x0023 067 044 025 Pre-fail >> Always - 10165 >> 4 Start_Stop_Count 0x0032 100 100 000 Old_age >> Always - 18 >> 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail >> Always - 0 >> 7 Seek_Error_Rate 0x002e 252 252 051 Old_age >> Always - 0 >> 8 Seek_Time_Performance 0x0024 252 252 015 Old_age >> Offline - 0 >> 9 Power_On_Hours 0x0032 100 100 000 Old_age >> Always - 6452 >> 10 Spin_Retry_Count 0x0032 252 252 051 Old_age >> Always - 0 >> 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age >> Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age >> Always - 20 >> 181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age >> Always - 10121757 >> 191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age >> Always - 1 >> 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age >> Always - 0 >> 194 Temperature_Celsius 0x0002 064 057 000 Old_age >> Always - 31 (Min/Max 16/43) >> 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age >> Always - 0 >> 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age >> Always - 0 >> 197 Current_Pending_Sector 0x0032 100 100 000 Old_age >> Always - 8 >> 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age >> Offline - 1 >> 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age >> Always - 0 >> 200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age >> Always - 3 >> 223 Load_Retry_Count 0x0032 252 252 000 Old_age >> Always - 0 >> 225 Load_Cycle_Count 0x0032 100 100 000 Old_age >> Always - 21 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Completed: read failure 60% 6452 >> 1519520304 >> # 2 Extended offline Interrupted (host reset) 50% 6408 - >> # 3 Extended offline Completed without error 00% 6317 - >> # 4 Extended offline Completed without error 00% 6260 - >> # 5 Extended offline Completed without error 00% 6232 - >> # 6 Extended offline Completed without error 00% 6170 - >> # 7 Extended offline Completed without error 00% 6064 - >> # 8 Extended offline Completed without error 00% 6029 - >> # 9 Extended offline Completed without error 00% 5898 - >> #10 Extended offline Aborted by host 60% 5893 - >> #11 Extended offline Completed without error 00% 5728 - >> #12 Extended offline Completed without error 00% 5706 - >> #13 Extended offline Interrupted (host reset) 40% 5701 - >> #14 Extended offline Interrupted (host reset) 90% 5666 - >> #15 Extended offline Completed without error 00% 5560 - >> #16 Extended offline Completed without error 00% 5527 - >> #17 Extended offline Completed without error 00% 5392 - >> #18 Extended offline Completed without error 00% 5357 - >> #19 Extended offline Completed without error 00% 5250 - >> #20 Extended offline Completed without error 00% 4272 - >> #21 Extended offline Completed without error 00% 4017 - >> >> Note: selective self-test log revision number (0) not 1 implies that >> no selective self-test has ever been run >> SMART Selective self-test log data structure revision number 0 >> Note: revision number not 1 implies that no selective self-test has >> ever been run >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Completed_read_failure [60% left] (0-65535) >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute delay. >> >> :( I do have a bad HDD. RMA is already in progress, I need to take out >> the drive and ship it to Samsung in Holland. Will print labels at work >> on Tuesday. >> >> Questions: >> >> * How do I remove this HDD without causing damage to the array? Is >> this the correct way?: >> mdadm --manage /dev/md0 --fail /dev/sdh1 # fail the device >> mdadm --manage /dev/md0 --remove /dev/sdh1 # remove the device >> * (shut down the system gracefully) >> * (remove the HDD) >> * (install new HDD) >> * (start system) >> sfdisk -d /dev/sde | sfdisk /dev/sdh # partition the new HDD >> mdadm --manage /dev/md0 --add /dev/sdh1 # add the partition to the array >> >> * After removing the HDD, should I do another scrub? >> >> Thanks a lot in advance! >> >> /Mathias >> > > I think I need to hurry up: > > [13957.348692] ata10.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen > [13957.348704] ata10.00: failed command: READ FPDMA QUEUED > [13957.348716] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag 0 > ncq 4096 in > [13957.348719] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask > 0x4 (timeout) > [13957.348724] ata10.00: status: { DRDY } > [13957.348730] ata10.00: failed command: WRITE FPDMA QUEUED > [13957.348741] ata10.00: cmd 61/08:08:a8:e7:8d/00:00:2e:00:00/40 tag 1 > ncq 4096 out > [13957.348743] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask > 0x4 (timeout) > [13957.348749] ata10.00: status: { DRDY } > [13957.348759] ata10: hard resetting link > [13962.835319] ata10: link is slow to respond, please be patient (ready=0) > [13967.368679] ata10: SRST failed (errno=-16) > [13967.368693] ata10: hard resetting link > [13970.988699] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [13971.002394] ata10.00: configured for UDMA/133 > [13971.002407] ata10.00: device reported invalid CHS sector 0 > [13971.002413] ata10.00: device reported invalid CHS sector 0 > [13971.002427] ata10: EH complete > [14001.358848] ata10.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x6 frozen > [14001.358862] ata10.00: failed command: READ FPDMA QUEUED > [14001.358887] ata10.00: cmd 60/08:08:00:ab:5a/00:00:e1:00:00/40 tag 1 > ncq 4096 in > [14001.358890] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask > 0x4 (timeout) > [14001.358898] ata10.00: status: { DRDY } > [14001.358913] ata10: hard resetting link > [14006.845324] ata10: link is slow to respond, please be patient (ready=0) > [14011.378656] ata10: SRST failed (errno=-16) > [14011.378669] ata10: hard resetting link > [14016.865323] ata10: link is slow to respond, please be patient (ready=0) > [14021.398640] ata10: SRST failed (errno=-16) > [14021.398652] ata10: hard resetting link > [14026.885310] ata10: link is slow to respond, please be patient (ready=0) > [14029.925349] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [14029.939048] ata10.00: configured for UDMA/133 > [14029.939061] ata10.00: device reported invalid CHS sector 0 > [14029.939078] ata10: EH complete > [14060.345358] ata10.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen > [14060.345371] ata10.00: failed command: READ FPDMA QUEUED > [14060.345384] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag 0 > ncq 4096 in > [14060.345387] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask > 0x4 (timeout) > [14060.345394] ata10.00: status: { DRDY } > [14060.345407] ata10: hard resetting link > [14065.831985] ata10: link is slow to respond, please be patient (ready=0) > [14070.365333] ata10: SRST failed (errno=-16) > [14070.365345] ata10: hard resetting link > [14074.625347] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [14074.639043] ata10.00: configured for UDMA/133 > [14074.639056] ata10.00: device reported invalid CHS sector 0 > [14074.639088] ata10: EH complete > [14105.358687] ata10.00: NCQ disabled due to excessive errors > [14105.358700] ata10.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen > [14105.358712] ata10.00: failed command: READ FPDMA QUEUED > [14105.358729] ata10.00: cmd 60/08:00:00:ab:5a/00:00:e1:00:00/40 tag 0 > ncq 4096 in > [14105.358732] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask > 0x4 (timeout) > [14105.358741] ata10.00: status: { DRDY } > [14105.358754] ata10: hard resetting link > [14110.845314] ata10: link is slow to respond, please be patient (ready=0) > [14115.378674] ata10: SRST failed (errno=-16) > [14115.378689] ata10: hard resetting link > [14119.372023] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > [14119.385704] ata10.00: configured for UDMA/133 > [14119.385716] ata10.00: device reported invalid CHS sector 0 > [14119.385743] ata10: EH complete > [14121.527814] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 > [14121.527824] ata10.00: edma_err_cause=00000084 pp_flags=00000001, > dev error, EDMA self-disable > [14121.527832] ata10.00: failed command: READ DMA EXT > [14121.527844] ata10.00: cmd 25/00:08:00:ab:5a/00:00:e1:00:00/e0 tag 0 > dma 4096 in > [14121.527846] res 51/89:08:00:ab:5a/89:00:e1:00:00/e0 Emask > 0x10 (ATA bus error) > [14121.527852] ata10.00: status: { DRDY ERR } > [14121.527857] ata10.00: error: { ICRC } > [14121.527867] ata10: hard resetting link > [14127.011973] ata10: link is slow to respond, please be patient (ready=0) > [14131.545295] ata10: SRST failed (errno=-16) > [14131.545307] ata10: hard resetting link > [14137.031984] ata10: link is slow to respond, please be patient (ready=0) > [14141.565306] ata10: SRST failed (errno=-16) > [14141.565317] ata10: hard resetting link > [14147.051993] ata10: link is slow to respond, please be patient (ready=0) > [14161.032035] INFO: task jbd2/dm-0-8:613 blocked for more than 120 seconds. > [14161.032044] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [14161.032050] jbd2/dm-0-8 D ffffffff81823020 0 613 2 0x00000000 > [14161.032060] ffff8800c91e11a0 0000000000000046 0000000000000000 > ffff8800cf0697c0 > [14161.032070] ffff8800cf069780 ffff8800cf069780 ffff8800c93bdfd8 > ffff8800c93bdfd8 > [14161.032079] ffff8800c91e13d8 0000000000004000 ffff880094ed19c0 > ffff8800c97e0688 > [14161.032087] Call Trace: > [14161.032106] [<ffffffff81204c67>] ? generic_make_request+0x2f7/0x570 > [14161.032116] [<ffffffff81133200>] ? __wait_on_buffer+0x30/0x30 > [14161.032124] [<ffffffff814262e7>] ? io_schedule+0x57/0x80 > [14161.032131] [<ffffffff8113320a>] ? sleep_on_buffer+0xa/0x20 > [14161.032137] [<ffffffff81426a2f>] ? __wait_on_bit+0x4f/0x80 > [14161.032143] [<ffffffff81133200>] ? __wait_on_buffer+0x30/0x30 > [14161.032150] [<ffffffff81426add>] ? out_of_line_wait_on_bit+0x7d/0xa0 > [14161.032159] [<ffffffff81059300>] ? autoremove_wake_function+0x30/0x30 > [14161.032168] [<ffffffff811b680e>] ? > jbd2_journal_commit_transaction+0x155e/0x16f0 > [14161.032176] [<ffffffff810592d0>] ? abort_exclusive_wait+0xb0/0xb0 > [14161.032183] [<ffffffff8142968e>] ? apic_timer_interrupt+0xe/0x20 > [14161.032191] [<ffffffff811baddd>] ? kjournald2+0xad/0x210 > [14161.032198] [<ffffffff810592d0>] ? abort_exclusive_wait+0xb0/0xb0 > [14161.032205] [<ffffffff811bad30>] ? commit_timeout+0x10/0x10 > [14161.032212] [<ffffffff81058a0f>] ? kthread+0x7f/0x90 > [14161.032219] [<ffffffff81429a54>] ? kernel_thread_helper+0x4/0x10 > [14161.032226] [<ffffffff81058990>] ? kthread_worker_fn+0x180/0x180 > [14161.032233] [<ffffffff81429a50>] ? gs_change+0xb/0xb > [14161.032261] INFO: task squid:1659 blocked for more than 120 seconds. > [14161.032264] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [14161.032269] squid D ffffffff81823020 0 1659 1657 0x00000000 > [14161.032277] ffff8800c9179d60 0000000000000082 ffff8800c97b0688 > ffffffff812bf074 > [14161.032285] ffff880012b818c0 ffffffff81823020 ffff8800c22c5fd8 > ffff8800c22c5fd8 > [14161.032293] ffff8800c9179f98 0000000000004000 ffff8800c22c5fd8 > 0000000000000000 > [14161.032301] Call Trace: > > > :-/ > > Currently shutting down all daemons that access the filesystem on the > array, in attempt to umount the fs. > Sorry for spamming like this, it's happening in realtime. While I was shutting down daemons it looks like MD took out the failing HDD itself, see: [14401.032600] INFO: task flush-253:0:25632 blocked for more than 120 seconds. [14401.032604] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [14401.032608] flush-253:0 D ffffffff81823020 0 25632 2 0x00000000 [14401.032615] ffff88003c02bac0 0000000000000046 ffffffff81425a7f ffff8800cf06f5c0 [14401.032623] ffff8800cf06f580 ffffffff81823020 ffff88001666ffd8 ffff88001666ffd8 [14401.032632] ffff88003c02bcf8 0000000000004000 ffff88001666ffd8 ffff88003c02bd00 [14401.032639] Call Trace: [14401.032645] [<ffffffff81425a7f>] ? schedule+0x49f/0xcb0 [14401.032653] [<ffffffff8105f465>] ? sched_clock_local+0x15/0x80 [14401.032662] [<ffffffff81343287>] ? get_active_stripe+0x307/0x6f0 [14401.032669] [<ffffffff81032ab0>] ? try_to_wake_up+0x210/0x210 [14401.032676] [<ffffffff81346b38>] ? make_request+0x198/0x660 [14401.032683] [<ffffffff8135b4fa>] ? __map_bio+0x4a/0x1d0 [14401.032690] [<ffffffff810592d0>] ? abort_exclusive_wait+0xb0/0xb0 [14401.032697] [<ffffffff8134d563>] ? md_make_request+0x103/0x250 [14401.032704] [<ffffffff81204c67>] ? generic_make_request+0x2f7/0x570 [14401.032710] [<ffffffff810f7399>] ? kmem_cache_alloc+0x169/0x180 [14401.032718] [<ffffffff8135bff4>] ? dm_get_live_table+0x44/0x60 [14401.032724] [<ffffffff81360225>] ? linear_merge+0x45/0x50 [14401.032731] [<ffffffff81204f4d>] ? submit_bio+0x6d/0x100 [14401.032738] [<ffffffff8118296c>] ? ext4_io_submit+0x1c/0x50 [14401.032744] [<ffffffff81182ac1>] ? ext4_bio_write_page+0x121/0x370 [14401.032751] [<ffffffff8117c137>] ? mpage_da_submit_io+0x347/0x450 [14401.032759] [<ffffffff81180eae>] ? mpage_da_map_and_submit+0x1ce/0x420 [14401.032766] [<ffffffff81181920>] ? ext4_da_writepages+0x340/0x620 [14401.032774] [<ffffffff812071f7>] ? blk_flush_plug_list+0xa7/0x250 [14401.032782] [<ffffffff8112c41e>] ? writeback_single_inode+0x10e/0x270 [14401.032789] [<ffffffff8112c801>] ? writeback_sb_inodes+0xf1/0x1b0 [14401.032796] [<ffffffff814297ee>] ? reschedule_interrupt+0xe/0x20 [14401.032803] [<ffffffff8112d36b>] ? writeback_inodes_wb+0x7b/0x150 [14401.032810] [<ffffffff8112d8d3>] ? wb_writeback+0x493/0x4f0 [14401.032818] [<ffffffff8111fec2>] ? get_nr_inodes+0x42/0x60 [14401.032825] [<ffffffff8112d9c7>] ? wb_check_old_data_flush+0x97/0xa0 [14401.032832] [<ffffffff8112db3f>] ? wb_do_writeback+0x16f/0x210 [14401.032839] [<ffffffff81046dd0>] ? init_timer_deferrable_key+0x10/0x10 [14401.032846] [<ffffffff8112dc5b>] ? bdi_writeback_thread+0x7b/0x310 [14401.032852] [<ffffffff8102d149>] ? __wake_up_common+0x49/0x80 [14401.032860] [<ffffffff8112dbe0>] ? wb_do_writeback+0x210/0x210 [14401.032866] [<ffffffff81058a0f>] ? kthread+0x7f/0x90 [14401.032873] [<ffffffff81429a54>] ? kernel_thread_helper+0x4/0x10 [14401.032880] [<ffffffff81058990>] ? kthread_worker_fn+0x180/0x180 [14401.032887] [<ffffffff81429a50>] ? gs_change+0xb/0xb [14404.915260] ata10: link is slow to respond, please be patient (ready=0) [14409.448594] ata10: SRST failed (errno=-16) [14409.448604] ata10: hard resetting link [14414.935260] ata10: link is slow to respond, please be patient (ready=0) [14431.895309] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [14431.909004] ata10.00: configured for UDMA/33 [14431.909026] ata10: EH complete [14434.040581] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 [14434.040589] ata10.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable [14434.040597] ata10.00: failed command: READ DMA EXT [14434.040609] ata10.00: cmd 25/00:08:10:e0:59/00:00:b2:00:00/e0 tag 0 dma 4096 in [14434.040612] res 51/89:08:10:e0:59/89:00:b2:00:00/e0 Emask 0x10 (ATA bus error) [14434.040617] ata10.00: status: { DRDY ERR } [14434.040622] ata10.00: error: { ICRC } [14434.040631] ata10: hard resetting link [14439.525256] ata10: link is slow to respond, please be patient (ready=0) [14444.058589] ata10: SRST failed (errno=-16) [14444.058600] ata10: hard resetting link [14449.545254] ata10: link is slow to respond, please be patient (ready=0) [14454.078594] ata10: SRST failed (errno=-16) [14454.078604] ata10: hard resetting link [14459.565256] ata10: link is slow to respond, please be patient (ready=0) [14476.525289] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [14476.579007] ata10.00: configured for UDMA/33 [14476.579038] sd 9:0:0:0: [sdh] Device not ready [14476.579043] sd 9:0:0:0: [sdh] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [14476.579049] sd 9:0:0:0: [sdh] Sense Key : Not Ready [current] [descriptor] [14476.579057] Descriptor sense data with sense descriptors (in hex): [14476.579061] 72 02 04 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [14476.579076] b2 59 e0 10 [14476.579083] sd 9:0:0:0: [sdh] Add. Sense: Logical unit not ready, cause not reportable [14476.579091] sd 9:0:0:0: [sdh] CDB: Read(10): 28 00 b2 59 e0 10 00 00 08 00 [14476.579106] end_request: I/O error, dev sdh, sector 2992234512 [14476.579147] ata10: EH complete [14484.352060] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [14484.352070] ata10.00: failed command: SMART [14484.352082] ata10.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0 [14484.352084] res 40/00:08:10:e0:59/89:00:b2:00:00/e0 Emask 0x4 (timeout) [14484.352090] ata10.00: status: { DRDY } [14484.352099] ata10: hard resetting link [14489.838590] ata10: link is slow to respond, please be patient (ready=0) [14494.371940] ata10: SRST failed (errno=-16) [14494.371951] ata10: hard resetting link [14499.858587] ata10: link is slow to respond, please be patient (ready=0) [14504.391923] ata10: SRST failed (errno=-16) [14504.391933] ata10: hard resetting link [14509.878583] ata10: link is slow to respond, please be patient (ready=0) [14536.995640] nfsd: last server has exited, flushing export cache [14539.425250] ata10: SRST failed (errno=-16) [14539.425263] ata10: hard resetting link [14544.431917] ata10: SRST failed (errno=-16) [14544.431926] ata10: reset failed, giving up [14544.431932] ata10.00: disabled [14544.431973] ata10: EH complete [14544.432042] sd 9:0:0:0: [sdh] Unhandled error code [14544.432062] sd 9:0:0:0: [sdh] Unhandled error code [14544.432080] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.432111] sd 9:0:0:0: [sdh] [14544.432126] sd 9:0:0:0: [sdh] CDB: Read(10)Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.432163] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a ab 00 00 00 08 [14544.432216] sd 9:0:0:0: [sdh] Unhandled error code [14544.432232] 00 [14544.432240] sd 9:0:0:0: [sdh] [14544.432253] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.432272] end_request: I/O error, dev sdh, sector 3780815616 [14544.432290] sd 9:0:0:0: [sdh] CDB: [14544.432305] md/raid:md0: Disk failure on sdh1, disabling device. [14544.432321] md/raid:md0: Operation continuing on 6 devices. [14544.432336] Write(10): 2a 00 e1 5a ef 78 00 00 80 00 [14544.432389] end_request: I/O error, dev sdh, sector 3780833144 [14544.432412] : 28 00 b2 33 44 80 00 00 08 00 [14544.432433] end_request: I/O error, dev sdh, sector 2989704320 [14544.432453] sd 9:0:0:0: [sdh] Unhandled error code [14544.432471] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.432498] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a ef f8 00 00 80 00 [14544.432553] end_request: I/O error, dev sdh, sector 3780833272 [14544.432618] sd 9:0:0:0: [sdh] Unhandled error code [14544.432639] sd 9:0:0:0: [sdh] Unhandled error code [14544.432659] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.432687] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 b2 59 e0 10 00 00 08 00 [14544.432749] end_request: I/O error, dev sdh, sector 2992234512 [14544.432774] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.432785] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 [14544.432800] sd 9:0:0:0: [sdh] Unhandled error code [14544.432820] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.432849] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a eb 00 00 00 80 00 [14544.432910] end_request: I/O error, dev sdh, sector 3780832000 [14544.432933] e1 5a f0 78 00 00 80 00 [14544.432949] end_request: I/O error, dev sdh, sector 3780833400 [14544.432965] sd 9:0:0:0: [sdh] Unhandled error code [14544.432972] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.432982] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a ef 00 00 00 78 00 [14544.433002] end_request: I/O error, dev sdh, sector 3780833024 [14544.433012] sd 9:0:0:0: [sdh] Unhandled error code [14544.433033] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.433062] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a f0 f8 00 00 80 00 [14544.433121] end_request: I/O error, dev sdh, sector 3780833528 [14544.433205] sd 9:0:0:0: [sdh] Unhandled error code [14544.433227] sd 9:0:0:0: [sdh] Unhandled error code [14544.433247] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.433276] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a f1 78 00 00 80 00 [14544.433333] end_request: I/O error, dev sdh, sector 3780833656 [14544.433357] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.433366] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 2e 8d e7 b0 00 00 28 00 [14544.433387] end_request: I/O error, dev sdh, sector 781051824 [14544.433399] sd 9:0:0:0: [sdh] Unhandled error code [14544.433408] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.433419] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a f1 f8 00 00 80 00 [14544.433440] end_request: I/O error, dev sdh, sector 3780833784 [14544.433487] sd 9:0:0:0: [sdh] Unhandled error code [14544.433493] sd 9:0:0:0: [sdh] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [14544.433502] sd 9:0:0:0: [sdh] CDB: Write(10): 2a 00 e1 5a f2 78 00 00 08 00 [14544.433523] end_request: I/O error, dev sdh, sector 3780833912 [14544.488761] RAID conf printout: [14544.488772] --- level:6 rd:7 wd:6 [14544.488779] disk 0, o:1, dev:sdg1 [14544.488783] disk 1, o:1, dev:sdb1 [14544.488788] disk 2, o:1, dev:sdd1 [14544.488793] disk 3, o:1, dev:sdc1 [14544.488797] disk 4, o:1, dev:sdf1 [14544.488801] disk 5, o:0, dev:sdh1 [14544.488805] disk 6, o:1, dev:sde1 [14544.515279] RAID conf printout: [14544.515289] --- level:6 rd:7 wd:6 [14544.515295] disk 0, o:1, dev:sdg1 [14544.515299] disk 1, o:1, dev:sdb1 [14544.515304] disk 2, o:1, dev:sdd1 [14544.515308] disk 3, o:1, dev:sdc1 [14544.515313] disk 4, o:1, dev:sdf1 [14544.515317] disk 6, o:1, dev:sde1 [14570.639011] nvidia 0000:00:03.5: PCI INT B disabled [14570.639441] nvidia 0000:03:00.0: PCI INT A disabled [14591.505699] HDA Intel 0000:00:08.0: PCI INT A disabled ~ $ dmesg > /srv/http/dmesg-failing-harddrive.log ~ $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[6](F) sdg1[0] sdf1[5] sde1[7] sdd1[4] sdb1[1] sdc1[3] 9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2 [7/6] [UUUUU_U] unused devices: <none> ~ $ $ mdadm -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Oct 19 08:58:41 2010 Raid Level : raid6 Array Size : 9751756800 (9300.00 GiB 9985.80 GB) Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB) Raid Devices : 7 Total Devices : 7 Persistence : Superblock is persistent Update Time : Sun Jul 31 19:07:25 2011 State : clean, degraded Active Devices : 6 Working Devices : 6 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Name : ion:0 (local to host ion) UUID : e6595c64:b3ae90b3:f01133ac:3f402d20 Events : 6158774 Number Major Minor RaidDevice State 0 8 97 0 active sync /dev/sdg1 1 8 17 1 active sync /dev/sdb1 4 8 49 2 active sync /dev/sdd1 3 8 33 3 active sync /dev/sdc1 5 8 81 4 active sync /dev/sdf1 5 0 0 5 removed 7 8 65 6 active sync /dev/sde1 6 8 113 - faulty spare /dev/sdh1 Shutting down the LV: $ vgchange -an lvstorage /dev/sdh: read failed after 0 of 4096 at 0: Input/output error /dev/sdh: read failed after 0 of 4096 at 2000398843904: Input/output error /dev/sdh: read failed after 0 of 4096 at 2000398925824: Input/output error /dev/sdh: read failed after 0 of 4096 at 4096: Input/output error /dev/sdh1: read failed after 0 of 4096 at 2000397795328: Input/output error /dev/sdh1: read failed after 0 of 4096 at 2000397877248: Input/output error /dev/sdh1: read failed after 0 of 4096 at 0: Input/output error /dev/sdh1: read failed after 0 of 4096 at 4096: Input/output error 0 logical volume(s) in volume group "lvstorage" now active Removing the bad HDD: $ mdadm --manage /dev/md0 --remove /dev/sdh1 mdadm: hot removed /dev/sdh1 from /dev/md0 $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdg1[0] sdf1[5] sde1[7] sdd1[4] sdb1[1] sdc1[3] 9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2 [7/6] [UUUUU_U] unused devices: <none> $ mdadm -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Oct 19 08:58:41 2010 Raid Level : raid6 Array Size : 9751756800 (9300.00 GiB 9985.80 GB) Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB) Raid Devices : 7 Total Devices : 6 Persistence : Superblock is persistent Update Time : Sun Jul 31 19:13:02 2011 State : clean, degraded Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Name : ion:0 (local to host ion) UUID : e6595c64:b3ae90b3:f01133ac:3f402d20 Events : 6158779 Number Major Minor RaidDevice State 0 8 97 0 active sync /dev/sdg1 1 8 17 1 active sync /dev/sdb1 4 8 49 2 active sync /dev/sdd1 3 8 33 3 active sync /dev/sdc1 5 8 81 4 active sync /dev/sdf1 5 0 0 5 removed 7 8 65 6 active sync /dev/sde1 I'll shut down the system now and remove the HDD. I suppose I've just one question to ask then; should I rescrub the array when it's up with 1 HDD removed? Thanks again, Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html