Hello Phil and the list, Le dimanche 8 novembre 2015, 22:35:13 Phil Turmel a écrit : > > On 11/08/2015 09:49 PM, Guillaume Paumier wrote: > > > > If I understand the documentation [1] correctly, since the event count for > > sdj is very close to the event count of sd[b,c,d,g,h,i], I should be able > > to re- assemble the array with these 7 disks using --force, leaving sde > > and sdf aside. Once the array is assembled, I should be able to re-add > > sde and sdf, and they will be re-sync'd. > > Yes, that is the correct response. > > Your situation is common. Please see the thread this weekend started by > Franscisco Parada. Thank you for confirming, Phil, and for the additional pointer. I've re-assembled the array with --force, which cleaned sdj, and then I was able to re-add the two other disks. The array started rebuilding and recovery was past 10% when the array failed again. It seems there was an "unrecoverable read error" on sdj, and now I'm back with an array where 2 of the disks are marked as spare (sde and sdf, because their rebuild didn't complete), and sdj is faulty with an event count mismatch of 4, like before: /dev/sdb1: Events : 198704 /dev/sdc1: Events : 198704 /dev/sdd1: Events : 198704 /dev/sde1: Events : 198704 /dev/sdf1: Events : 198704 /dev/sdg1: Events : 198704 /dev/sdh1: Events : 198704 /dev/sdi1: Events : 198704 /dev/sdj1: Events : 198700 Below is the output of dmesg with more details on the read error. Is there any way I can move past this? This error is preventing me from rebuilding the array, and I'm assuming it would also prevent me from copying the data off the array without rebuilding, so I'm not sure how to proceed. Any guidance would be much appreciated. [88233.712961] md: recovery of RAID array md0 [88233.712965] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [88233.712967] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [88233.712978] md: using 128k window, over a total of 3907016448k. [88953.752335] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [88953.752345] ata9.01: BMDMA stat 0x64 [88953.752353] ata9.01: failed command: READ DMA EXT [88953.752368] ata9.01: cmd 25/00:00:00:fc:e8/00:02:27:00:00/f0 tag 0 dma 262144 in res 51/40:00:f8:fd:e8/40:00:27:00:00/10 Emask 0x9 (media error) [88953.752375] ata9.01: status: { DRDY ERR } [88953.752380] ata9.01: error: { UNC } [88953.793877] ata9.00: configured for UDMA/33 [88953.799795] ata9.01: configured for UDMA/33 [88953.799855] sd 8:0:1:0: [sdj] Unhandled sense code [88953.799858] sd 8:0:1:0: [sdj] [88953.799860] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [88953.799862] sd 8:0:1:0: [sdj] [88953.799864] Sense Key : Medium Error [current] [descriptor] [88953.799867] Descriptor sense data with sense descriptors (in hex): [88953.799868] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [88953.799875] 27 e8 fd f8 [88953.799879] sd 8:0:1:0: [sdj] [88953.799882] Add. Sense: Unrecovered read error - auto reallocate failed [88953.799884] sd 8:0:1:0: [sdj] CDB: [88953.799885] Read(16): 88 00 00 00 00 00 27 e8 fc 00 00 00 02 00 00 00 [88953.799894] end_request: I/O error, dev sdj, sector 669580792 [88953.799898] md/raid:md0: read error not correctable (sector 669578744 on sdj1). [88953.799924] ata9: EH complete [89333.138473] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [89333.138478] ata9.01: BMDMA stat 0x64 [89333.138482] ata9.01: failed command: READ DMA EXT [89333.138488] ata9.01: cmd 25/00:00:58:6e:3b/00:02:35:00:00/f0 tag 0 dma 262144 in res 51/40:00:c8:6f:3b/40:00:35:00:00/10 Emask 0x9 (media error) [89333.138491] ata9.01: status: { DRDY ERR } [89333.138493] ata9.01: error: { UNC } [89333.147985] ata9.00: configured for UDMA/33 [89333.153966] ata9.01: configured for UDMA/33 [89333.154022] sd 8:0:1:0: [sdj] Unhandled sense code [89333.154025] sd 8:0:1:0: [sdj] [89333.154027] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [89333.154029] sd 8:0:1:0: [sdj] [89333.154031] Sense Key : Medium Error [current] [descriptor] [89333.154034] Descriptor sense data with sense descriptors (in hex): [89333.154035] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [89333.154042] 35 3b 6f c8 [89333.154046] sd 8:0:1:0: [sdj] [89333.154048] Add. Sense: Unrecovered read error - auto reallocate failed [89333.154050] sd 8:0:1:0: [sdj] CDB: [89333.154052] Read(16): 88 00 00 00 00 00 35 3b 6e 58 00 00 02 00 00 00 [89333.154061] end_request: I/O error, dev sdj, sector 893087688 [89333.154064] md/raid:md0: read error not correctable (sector 893085640 on sdj1). [89333.154067] md/raid:md0: read error not correctable (sector 893085648 on sdj1). [89333.154069] md/raid:md0: read error not correctable (sector 893085656 on sdj1). [89333.154071] md/raid:md0: read error not correctable (sector 893085664 on sdj1). [89333.154073] md/raid:md0: read error not correctable (sector 893085672 on sdj1). [89333.154075] md/raid:md0: read error not correctable (sector 893085680 on sdj1). [89333.154077] md/raid:md0: read error not correctable (sector 893085688 on sdj1). [89333.154079] md/raid:md0: read error not correctable (sector 893085696 on sdj1). [89333.154081] md/raid:md0: read error not correctable (sector 893085704 on sdj1). [89333.154083] md/raid:md0: read error not correctable (sector 893085712 on sdj1). [89333.154111] ata9: EH complete [89338.097012] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [89338.097016] ata9.01: BMDMA stat 0x64 [89338.097019] ata9.01: failed command: READ DMA EXT [89338.097023] ata9.01: cmd 25/00:00:58:70:3b/00:02:35:00:00/f0 tag 0 dma 262144 in res 51/40:00:60:70:3b/40:00:35:00:00/10 Emask 0x9 (media error) [89338.097025] ata9.01: status: { DRDY ERR } [89338.097026] ata9.01: error: { UNC } [89338.125468] ata9.00: configured for UDMA/33 [89338.131458] ata9.01: configured for UDMA/33 [89338.131489] sd 8:0:1:0: [sdj] Unhandled sense code [89338.131491] sd 8:0:1:0: [sdj] [89338.131492] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [89338.131493] sd 8:0:1:0: [sdj] [89338.131494] Sense Key : Medium Error [current] [descriptor] [89338.131496] Descriptor sense data with sense descriptors (in hex): [89338.131497] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [89338.131502] 35 3b 70 60 [89338.131504] sd 8:0:1:0: [sdj] [89338.131506] Add. Sense: Unrecovered read error - auto reallocate failed [89338.131507] sd 8:0:1:0: [sdj] CDB: [89338.131508] Read(16): 88 00 00 00 00 00 35 3b 70 58 00 00 02 00 00 00 [89338.131513] end_request: I/O error, dev sdj, sector 893087840 [89338.131556] ata9: EH complete [89342.103300] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [89342.103310] ata9.01: BMDMA stat 0x64 [89342.103319] ata9.01: failed command: READ DMA EXT [89342.103333] ata9.01: cmd 25/00:00:58:72:3b/00:02:35:00:00/f0 tag 0 dma 262144 in res 51/40:00:58:72:3b/40:00:35:00:00/10 Emask 0x9 (media error) [89342.103340] ata9.01: status: { DRDY ERR } [89342.103344] ata9.01: error: { UNC } [89342.224995] ata9.00: configured for UDMA/33 [89342.230983] ata9.01: configured for UDMA/33 [89342.231022] sd 8:0:1:0: [sdj] Unhandled sense code [89342.231025] sd 8:0:1:0: [sdj] [89342.231027] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [89342.231029] sd 8:0:1:0: [sdj] [89342.231031] Sense Key : Medium Error [current] [descriptor] [89342.231034] Descriptor sense data with sense descriptors (in hex): [89342.231035] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [89342.231042] 35 3b 72 58 [89342.231046] sd 8:0:1:0: [sdj] [89342.231049] Add. Sense: Unrecovered read error - auto reallocate failed [89342.231051] sd 8:0:1:0: [sdj] CDB: [89342.231052] Read(16): 88 00 00 00 00 00 35 3b 72 58 00 00 02 00 00 00 [89342.231061] end_request: I/O error, dev sdj, sector 893088344 [89342.231065] raid5_end_read_request: 71 callbacks suppressed [89342.231067] md/raid:md0: read error not correctable (sector 893086296 on sdj1). [89342.231070] md/raid:md0: read error not correctable (sector 893086304 on sdj1). [89342.231072] md/raid:md0: read error not correctable (sector 893086312 on sdj1). [89342.231074] md/raid:md0: read error not correctable (sector 893086320 on sdj1). [89342.231076] md/raid:md0: read error not correctable (sector 893086328 on sdj1). [89342.231078] md/raid:md0: read error not correctable (sector 893086336 on sdj1). [89342.231080] md/raid:md0: read error not correctable (sector 893086344 on sdj1). [89342.231081] md/raid:md0: read error not correctable (sector 893086352 on sdj1). [89342.231083] md/raid:md0: read error not correctable (sector 893086360 on sdj1). [89342.231085] md/raid:md0: read error not correctable (sector 893086368 on sdj1). [89342.231149] ata9: EH complete [89346.169717] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [89346.169727] ata9.01: BMDMA stat 0x64 [89346.169736] ata9.01: failed command: READ DMA EXT [89346.169750] ata9.01: cmd 25/00:00:58:74:3b/00:02:35:00:00/f0 tag 0 dma 262144 in res 51/40:00:58:74:3b/40:00:35:00:00/10 Emask 0x9 (media error) [89346.169758] ata9.01: status: { DRDY ERR } [89346.169763] ata9.01: error: { UNC } [89346.198239] ata9.00: configured for UDMA/33 [89346.204166] ata9.01: configured for UDMA/33 [89346.204232] sd 8:0:1:0: [sdj] Unhandled sense code [89346.204239] sd 8:0:1:0: [sdj] [89346.204243] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [89346.204248] sd 8:0:1:0: [sdj] [89346.204251] Sense Key : Medium Error [current] [descriptor] [89346.204258] Descriptor sense data with sense descriptors (in hex): [89346.204261] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [89346.204278] 35 3b 74 58 [89346.204286] sd 8:0:1:0: [sdj] [89346.204292] Add. Sense: Unrecovered read error - auto reallocate failed [89346.204296] sd 8:0:1:0: [sdj] CDB: [89346.204299] Read(16): 88 00 00 00 00 00 35 3b 74 58 00 00 02 00 00 00 [89346.204319] end_request: I/O error, dev sdj, sector 893088856 [89346.204419] ata9: EH complete [89353.949976] ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [89353.949986] ata9.01: BMDMA stat 0x64 [89353.949994] ata9.01: failed command: READ DMA EXT [89353.950008] ata9.01: cmd 25/00:90:c8:6f:3b/00:00:35:00:00/f0 tag 0 dma 73728 in res 51/40:00:e0:6f:3b/40:00:35:00:00/10 Emask 0x9 (media error) [89353.950016] ata9.01: status: { DRDY ERR } [89353.950021] ata9.01: error: { UNC } [89353.994545] ata9.00: configured for UDMA/33 [89354.000539] ata9.01: configured for UDMA/33 [89354.000597] sd 8:0:1:0: [sdj] Unhandled sense code [89354.000603] sd 8:0:1:0: [sdj] [89354.000608] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [89354.000612] sd 8:0:1:0: [sdj] [89354.000616] Sense Key : Medium Error [current] [descriptor] [89354.000623] Descriptor sense data with sense descriptors (in hex): [89354.000626] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [89354.000643] 35 3b 6f e0 [89354.000651] sd 8:0:1:0: [sdj] [89354.000657] Add. Sense: Unrecovered read error - auto reallocate failed [89354.000661] sd 8:0:1:0: [sdj] CDB: [89354.000664] Read(16): 88 00 00 00 00 00 35 3b 6f c8 00 00 00 90 00 00 [89354.000684] end_request: I/O error, dev sdj, sector 893087712 [89354.000692] raid5_end_read_request: 118 callbacks suppressed [89354.000697] md/raid:md0: read error not correctable (sector 893085664 on sdj1). [89354.000706] md/raid:md0: Disk failure on sdj1, disabling device. md/raid:md0: Operation continuing on 6 devices. [89354.000732] md/raid:md0: read error not correctable (sector 893085672 on sdj1). [89354.000737] md/raid:md0: read error not correctable (sector 893085680 on sdj1). [89354.000742] md/raid:md0: read error not correctable (sector 893085688 on sdj1). [89354.000747] md/raid:md0: read error not correctable (sector 893085696 on sdj1). [89354.000751] md/raid:md0: read error not correctable (sector 893085704 on sdj1). [89354.000756] md/raid:md0: read error not correctable (sector 893085712 on sdj1). [89354.000760] md/raid:md0: read error not correctable (sector 893085720 on sdj1). [89354.000765] md/raid:md0: read error not correctable (sector 893085728 on sdj1). [89354.000769] md/raid:md0: read error not correctable (sector 893085736 on sdj1). [89354.000903] ata9: EH complete [89354.109105] md: md0: recovery interrupted. [89354.175670] RAID conf printout: [89354.175675] --- level:6 rd:9 wd:6 [89354.175677] disk 0, o:1, dev:sdc1 [89354.175679] disk 1, o:1, dev:sdd1 [89354.175680] disk 2, o:1, dev:sdg1 [89354.175681] disk 3, o:1, dev:sdh1 [89354.175682] disk 4, o:1, dev:sdi1 [89354.175683] disk 5, o:0, dev:sdj1 [89354.175684] disk 6, o:1, dev:sdf1 [89354.175685] disk 7, o:1, dev:sde1 [89354.175686] disk 8, o:1, dev:sdb1 [89354.177220] RAID conf printout: [89354.177221] --- level:6 rd:9 wd:6 [89354.177222] disk 0, o:1, dev:sdc1 [89354.177223] disk 1, o:1, dev:sdd1 [89354.177224] disk 2, o:1, dev:sdg1 [89354.177225] disk 3, o:1, dev:sdh1 [89354.177226] disk 4, o:1, dev:sdi1 [89354.177227] disk 5, o:0, dev:sdj1 [89354.177227] disk 7, o:1, dev:sde1 [89354.177228] disk 8, o:1, dev:sdb1 [89354.177233] RAID conf printout: [89354.177234] --- level:6 rd:9 wd:6 [89354.177234] disk 0, o:1, dev:sdc1 [89354.177235] disk 1, o:1, dev:sdd1 [89354.177236] disk 2, o:1, dev:sdg1 [89354.177237] disk 3, o:1, dev:sdh1 [89354.177238] disk 4, o:1, dev:sdi1 [89354.177239] disk 5, o:0, dev:sdj1 [89354.177240] disk 7, o:1, dev:sde1 [89354.177241] disk 8, o:1, dev:sdb1 [89354.179575] RAID conf printout: [89354.179576] --- level:6 rd:9 wd:6 [89354.179577] disk 0, o:1, dev:sdc1 [89354.179578] disk 1, o:1, dev:sdd1 [89354.179579] disk 2, o:1, dev:sdg1 [89354.179580] disk 3, o:1, dev:sdh1 [89354.179581] disk 4, o:1, dev:sdi1 [89354.179582] disk 5, o:0, dev:sdj1 [89354.179583] disk 8, o:1, dev:sdb1 [89354.179585] RAID conf printout: [89354.179586] --- level:6 rd:9 wd:6 [89354.179587] disk 0, o:1, dev:sdc1 [89354.179588] disk 1, o:1, dev:sdd1 [89354.179589] disk 2, o:1, dev:sdg1 [89354.179589] disk 3, o:1, dev:sdh1 [89354.179590] disk 4, o:1, dev:sdi1 [89354.179591] disk 5, o:0, dev:sdj1 [89354.179592] disk 8, o:1, dev:sdb1 [89354.181443] RAID conf printout: [89354.181444] --- level:6 rd:9 wd:6 [89354.181445] disk 0, o:1, dev:sdc1 [89354.181446] disk 1, o:1, dev:sdd1 [89354.181447] disk 2, o:1, dev:sdg1 [89354.181448] disk 3, o:1, dev:sdh1 [89354.181449] disk 4, o:1, dev:sdi1 [89354.181450] disk 8, o:1, dev:sdb1 [90001.391680] md0: detected capacity change from 28005493899264 to 0 [90001.391697] md: md0 stopped. [90001.391717] md: unbind<sdf1> [90001.396688] md: export_rdev(sdf1) [90001.396808] md: unbind<sde1> [90001.403661] md: export_rdev(sde1) [90001.403726] md: unbind<sdc1> [90001.412707] md: export_rdev(sdc1) [90001.412867] md: unbind<sdb1> [90001.415711] md: export_rdev(sdb1) [90001.415782] md: unbind<sdj1> [90001.421708] md: export_rdev(sdj1) [90001.421783] md: unbind<sdi1> [90001.424752] md: export_rdev(sdi1) [90001.424909] md: unbind<sdh1> [90001.427741] md: export_rdev(sdh1) [90001.427807] md: unbind<sdg1> [90001.433745] md: export_rdev(sdg1) [90001.433812] md: unbind<sdd1> [90001.436732] md: export_rdev(sdd1) > You should provide "smartctl -i -A -l scterc /dev/sdX" reports for your > drives. If you can find an old syslog for when your two worst drives > fell out, it might help. Here's the output for the disk with the read error for now, in case it's useful. # smartctl -i -A -l scterc /dev/sdj smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-29-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: ST4000VN000-1H4168 Serial Number: Z300NEB5 LU WWN Device Id: 5 000c50 063ed9f94 Firmware Version: SC43 User Capacity: 4 000 787 030 016 bytes [4,00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5900 rpm Form Factor: 3.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Nov 9 18:58:47 2015 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 094 094 006 Pre-fail Always - 28320486 3 Spin_Up_Time 0x0003 092 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 73 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 160 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 17212021570 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 19201 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 73 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 055 055 000 Old_age Always - 45 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 169 190 Airflow_Temperature_Cel 0x0022 065 057 045 Old_age Always - 35 (Min/Max 30/37) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 28 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 73 194 Temperature_Celsius 0x0022 035 043 000 Old_age Always - 35 (0 18 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 48 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 48 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SCT Error Recovery Control: Read: 70 (7,0 seconds) Write: 70 (7,0 seconds) -- Guillaume Paumier -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html