On Mon, Sep 23, 2024 at 11:19 PM Stephane Bakhos <nuitari-vger@xxxxxxxxxxx> wrote: > > Check the following attributes: > > - Reallocated_Sector_Ct > - Seek_Error_Rate > - Current_Pending_Sector > - Offline_Uncorrectable > - Raw_Read_Error_Rate > > If any of these are increasing, its a sign of a dead drive. If > the drive is taking a long time to time out from a surface defect it could > be the cause. > > - UDMA_CRC_Error_Count > > That one is usually a sign of a bad cable. > Smart attributes look fine for all 10 drives. All values are above threshold: bill@bill-desk:~$ for i in {e..n}; do echo -e "\nsd$i\n" ; sudo smartctl -x /dev/sd$i | grep 'ID\#\|Reallocated_\|Seek_Error\|Current_Pending\|Offline_Un\|Raw_Read\|UDMA'] ; done sde ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 073 064 044 - 21117259 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 082 060 045 - 170317463 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdf ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 079 064 044 - 71711229 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 082 060 045 - 171651865 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdg ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 067 064 044 - 4728851 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 082 060 045 - 173788681 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdh ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 080 064 044 - 96543092 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 082 060 045 - 175410482 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdi ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 082 064 044 - 143574742 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 075 060 045 - 30335787 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdj ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 073 064 044 - 17809062 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 064 060 045 - 2577528 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdk ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 072 064 044 - 17754175 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 064 060 045 - 2765316 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdl ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 073 064 044 - 17827863 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 064 060 045 - 2761597 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdm ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 080 064 044 - 104938805 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 076 060 045 - 41478011 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 sdn ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 081 064 044 - 116043252 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 076 060 045 - 41655990 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 > >> Are the components of md1 (the unrelated array) on a different hardware > >> controller / wires? > > > > Same controller, but I see the same results even if I unplug md1. > > Have you tried replacing the cabling and the backplane? > > I'm not sure what layout you have, but something I'd try is to remove the > drives for md1 and put the md127 drives in the place where the md1 were. > It would help rule out that either is the issue. I haven't tried replacing the cabling, but when I swap the cables around between the two arrays, everything behaves the same way.