Re: do i need to give up on this setup

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Am 06.10.20 um 09:56 schrieb Daniel Sanabria:
> Yeah it is quite possible that the hardware can't support the setup
> because this array was an afterthought and considered an upgrade to
> the system.
> 
> For the record here are some more details about the setup:
> 
> Motherboard: ASROCK EP2C602-4L/D16
> PSU: 850W Corsair RM Series
> I have 6 drives connected to the motherboard. The 3 drives forming the
> array are Western Digital Green WDC WD30EZRX-00D8PB0 and are connected
> to 3 ports of the Marvell 9230 SATA controller. The other 3 drives are
> Western Digital Caviar Blue (SATA) WDC WD5000AAKS-00A7B2 are connected
> to the spare Marvell port and the 2 SATA/SAS Motherboard ports
> available.

yeah, unreliable desktop disks without at least increase the timeouts -
currently the only WD disks for a RAID setip are the "WD Gold" after
they lost theri brain and starting SMR on "WD Red"

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

-------------------------
[root@srv-rhsoft:~]$ cat /etc/systemd/system/disk-timeout.service
[Unit]
Description=SCSI Timeouts

[Service]
Type=oneshot
ExecStart=/usr/local/bin/disk-timeout.sh

[Install]
WantedBy=multi-user.target

-------------------------

[root@srv-rhsoft:~]$ cat /usr/local/bin/disk-timeout.sh
#!/usr/bin/dash

echo 180 > "/sys/block/sda/device/timeout"
echo 180 > "/sys/block/sdb/device/timeout"
echo 180 > "/sys/block/sdc/device/timeout"
echo 180 > "/sys/block/sdd/device/timeout"

-------------------------

> On Mon, 5 Oct 2020 at 16:58, Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
>>
>> what they said you have a hardware problem.
>>
>> it could be about anything previously mentioned and could also be the
>> power supply being unable to provide a stable 12V for the disks.
>>
>> You should provide the list more specifics on your hw setup, of
>> interest are what kind of SATA/SAS ports you are using and how the
>> disk are cabled in.
>>
>> Note that there are a number of controllers that aren't the most
>> reliable and some of those controllers when something happens will
>> stop responding for all disks connected to it.
>>
>> I have also seen badly designed motherboards have
>> build-in(non-AMD/non-Intel chips) sata ports that don't work under any
>> load that uses more than a single disk at a time, and/or acts badly
>> when given smart commands.
>>
>> On Mon, Oct 5, 2020 at 9:30 AM Daniel Sanabria <sanabria.d@xxxxxxxxx> wrote:
>>>
>>>> I meant not to me personally, but to the mailing list. The drives seem OK
>>>> though, even sde.
>>>
>>> Sorry missed the reply-all button
>>>
>>> On Mon, 5 Oct 2020 at 15:04, Roman Mamedov <rm@xxxxxxxxxxx> wrote:
>>>>
>>>> On Mon, 5 Oct 2020 14:59:35 +0100
>>>> Daniel Sanabria <sanabria.d@xxxxxxxxx> wrote:
>>>>
>>>>>> It looks like a drive is dropping off the bus and then failing to reidentify,
>>>>>> could be bad cabling/controller/PSU, or just a bad drive. You should post
>>>>>> "smartctl -a" of all drives as well.
>>>>
>>>> I meant not to me personally, but to the mailing list. The drives seem OK
>>>> though, even sde.
>>>>
>>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdc
>>>>> [sudo] password for dan:
>>>>> smartctl 6.6 2017-11-05 r4594
>>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>>>>
>>>>> === START OF INFORMATION SECTION ===
>>>>> Model Family:     Western Digital Green
>>>>> Device Model:     WDC WD30EZRX-00D8PB0
>>>>> Serial Number:    WD-WCC4NCWT13RF
>>>>> LU WWN Device Id: 5 0014ee 25fc9e460
>>>>> Firmware Version: 80.00A80
>>>>> User Capacity:    3,000,591,900,160 bytes [3.00 TB]
>>>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>>>> Rotation Rate:    5400 rpm
>>>>> Device is:        In smartctl database [for details use: -P show]
>>>>> ATA Version is:   ACS-2 (minor revision not indicated)
>>>>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>>>>> Local Time is:    Mon Oct  5 14:58:34 2020 BST
>>>>> SMART support is: Available - device has SMART capability.
>>>>> SMART support is: Enabled
>>>>>
>>>>> === START OF READ SMART DATA SECTION ===
>>>>> SMART overall-health self-assessment test result: PASSED
>>>>>
>>>>> General SMART Values:
>>>>> Offline data collection status:  (0x82) Offline data collection activity
>>>>> was completed without error.
>>>>> Auto Offline Data Collection: Enabled.
>>>>> Self-test execution status:      (   0) The previous self-test routine completed
>>>>> without error or no self-test has ever
>>>>> been run.
>>>>> Total time to complete Offline
>>>>> data collection: (38940) seconds.
>>>>> Offline data collection
>>>>> capabilities: (0x7b) SMART execute Offline immediate.
>>>>> Auto Offline data collection on/off support.
>>>>> Suspend Offline collection upon new
>>>>> command.
>>>>> Offline surface scan supported.
>>>>> Self-test supported.
>>>>> Conveyance Self-test supported.
>>>>> Selective Self-test supported.
>>>>> SMART capabilities:            (0x0003) Saves SMART data before entering
>>>>> power-saving mode.
>>>>> Supports SMART auto save timer.
>>>>> Error logging capability:        (0x01) Error logging supported.
>>>>> General Purpose Logging supported.
>>>>> Short self-test routine
>>>>> recommended polling time: (   2) minutes.
>>>>> Extended self-test routine
>>>>> recommended polling time: ( 391) minutes.
>>>>> Conveyance self-test routine
>>>>> recommended polling time: (   5) minutes.
>>>>> SCT capabilities:        (0x7035) SCT Status supported.
>>>>> SCT Feature Control supported.
>>>>> SCT Data Table supported.
>>>>>
>>>>> SMART Attributes Data Structure revision number: 16
>>>>> Vendor Specific SMART Attributes with Thresholds:
>>>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>>>>> UPDATED  WHEN_FAILED RAW_VALUE
>>>>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>>>>> Always       -       0
>>>>>   3 Spin_Up_Time            0x0027   178   165   021    Pre-fail
>>>>> Always       -       6075
>>>>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>>>>> Always       -       81
>>>>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>>>>> Always       -       0
>>>>>   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
>>>>> Always       -       0
>>>>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>>>>> Always       -       18577
>>>>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>>>>> Always       -       81
>>>>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       46
>>>>> 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
>>>>> Always       -       176661
>>>>> 194 Temperature_Celsius     0x0022   122   109   000    Old_age
>>>>> Always       -       28
>>>>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>>>>> Offline      -       0
>>>>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>>>>> Offline      -       0
>>>>>
>>>>> SMART Error Log Version: 1
>>>>> No Errors Logged
>>>>>
>>>>> SMART Self-test log structure revision number 1
>>>>> Num  Test_Description    Status                  Remaining
>>>>> LifeTime(hours)  LBA_of_first_error
>>>>> # 1  Extended offline    Completed without error       00%     17479         -
>>>>> # 2  Short offline       Completed without error       00%     15531         -
>>>>>
>>>>> SMART Selective self-test log data structure revision number 1
>>>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>>>>     1        0        0  Not_testing
>>>>>     2        0        0  Not_testing
>>>>>     3        0        0  Not_testing
>>>>>     4        0        0  Not_testing
>>>>>     5        0        0  Not_testing
>>>>> Selective self-test flags (0x0):
>>>>>   After scanning selected spans, do NOT read-scan remainder of disk.
>>>>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>>>>
>>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sdd
>>>>> smartctl 6.6 2017-11-05 r4594
>>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>>>>
>>>>> === START OF INFORMATION SECTION ===
>>>>> Model Family:     Western Digital Green
>>>>> Device Model:     WDC WD30EZRX-00D8PB0
>>>>> Serial Number:    WD-WCC4NPRDD6D7
>>>>> LU WWN Device Id: 5 0014ee 25fca27b1
>>>>> Firmware Version: 80.00A80
>>>>> User Capacity:    3,000,592,982,016 bytes [3.00 TB]
>>>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>>>> Rotation Rate:    5400 rpm
>>>>> Device is:        In smartctl database [for details use: -P show]
>>>>> ATA Version is:   ACS-2 (minor revision not indicated)
>>>>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>>>>> Local Time is:    Mon Oct  5 14:58:54 2020 BST
>>>>> SMART support is: Available - device has SMART capability.
>>>>> SMART support is: Enabled
>>>>>
>>>>> === START OF READ SMART DATA SECTION ===
>>>>> SMART overall-health self-assessment test result: PASSED
>>>>>
>>>>> General SMART Values:
>>>>> Offline data collection status:  (0x82) Offline data collection activity
>>>>> was completed without error.
>>>>> Auto Offline Data Collection: Enabled.
>>>>> Self-test execution status:      (   0) The previous self-test routine completed
>>>>> without error or no self-test has ever
>>>>> been run.
>>>>> Total time to complete Offline
>>>>> data collection: (39060) seconds.
>>>>> Offline data collection
>>>>> capabilities: (0x7b) SMART execute Offline immediate.
>>>>> Auto Offline data collection on/off support.
>>>>> Suspend Offline collection upon new
>>>>> command.
>>>>> Offline surface scan supported.
>>>>> Self-test supported.
>>>>> Conveyance Self-test supported.
>>>>> Selective Self-test supported.
>>>>> SMART capabilities:            (0x0003) Saves SMART data before entering
>>>>> power-saving mode.
>>>>> Supports SMART auto save timer.
>>>>> Error logging capability:        (0x01) Error logging supported.
>>>>> General Purpose Logging supported.
>>>>> Short self-test routine
>>>>> recommended polling time: (   2) minutes.
>>>>> Extended self-test routine
>>>>> recommended polling time: ( 392) minutes.
>>>>> Conveyance self-test routine
>>>>> recommended polling time: (   5) minutes.
>>>>> SCT capabilities:        (0x7035) SCT Status supported.
>>>>> SCT Feature Control supported.
>>>>> SCT Data Table supported.
>>>>>
>>>>> SMART Attributes Data Structure revision number: 16
>>>>> Vendor Specific SMART Attributes with Thresholds:
>>>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>>>>> UPDATED  WHEN_FAILED RAW_VALUE
>>>>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>>>>> Always       -       0
>>>>>   3 Spin_Up_Time            0x0027   178   164   021    Pre-fail
>>>>> Always       -       6100
>>>>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>>>>> Always       -       81
>>>>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>>>>> Always       -       0
>>>>>   7 Seek_Error_Rate         0x002e   100   253   000    Old_age
>>>>> Always       -       0
>>>>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>>>>> Always       -       18580
>>>>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>>>>> Always       -       81
>>>>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       53
>>>>> 193 Load_Cycle_Count        0x0032   136   136   000    Old_age
>>>>> Always       -       192427
>>>>> 194 Temperature_Celsius     0x0022   121   108   000    Old_age
>>>>> Always       -       29
>>>>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>>>>> Offline      -       0
>>>>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>>>>> Offline      -       0
>>>>>
>>>>> SMART Error Log Version: 1
>>>>> No Errors Logged
>>>>>
>>>>> SMART Self-test log structure revision number 1
>>>>> Num  Test_Description    Status                  Remaining
>>>>> LifeTime(hours)  LBA_of_first_error
>>>>> # 1  Extended offline    Completed without error       00%     17481         -
>>>>> # 2  Short offline       Completed without error       00%     15534         -
>>>>>
>>>>> SMART Selective self-test log data structure revision number 1
>>>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>>>>     1        0        0  Not_testing
>>>>>     2        0        0  Not_testing
>>>>>     3        0        0  Not_testing
>>>>>     4        0        0  Not_testing
>>>>>     5        0        0  Not_testing
>>>>> Selective self-test flags (0x0):
>>>>>   After scanning selected spans, do NOT read-scan remainder of disk.
>>>>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>>>>
>>>>> [dan@lamachine ~]$ sudo smartctl -a /dev/sde
>>>>> smartctl 6.6 2017-11-05 r4594
>>>>> [x86_64-linux-4.18.0-193.14.2.el8_2.x86_64] (local build)
>>>>> Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
>>>>>
>>>>> === START OF INFORMATION SECTION ===
>>>>> Model Family:     Western Digital Green
>>>>> Device Model:     WDC WD30EZRX-00D8PB0
>>>>> Serial Number:    WD-WCC4N1294906
>>>>> LU WWN Device Id: 5 0014ee 25f968120
>>>>> Firmware Version: 80.00A80
>>>>> User Capacity:    3,000,591,900,160 bytes [3.00 TB]
>>>>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>>>> Rotation Rate:    5400 rpm
>>>>> Device is:        In smartctl database [for details use: -P show]
>>>>> ATA Version is:   ACS-2 (minor revision not indicated)
>>>>> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
>>>>> Local Time is:    Mon Oct  5 14:58:57 2020 BST
>>>>> SMART support is: Available - device has SMART capability.
>>>>> SMART support is: Enabled
>>>>>
>>>>> === START OF READ SMART DATA SECTION ===
>>>>> SMART overall-health self-assessment test result: PASSED
>>>>>
>>>>> General SMART Values:
>>>>> Offline data collection status:  (0x82) Offline data collection activity
>>>>> was completed without error.
>>>>> Auto Offline Data Collection: Enabled.
>>>>> Self-test execution status:      (   0) The previous self-test routine completed
>>>>> without error or no self-test has ever
>>>>> been run.
>>>>> Total time to complete Offline
>>>>> data collection: (43200) seconds.
>>>>> Offline data collection
>>>>> capabilities: (0x7b) SMART execute Offline immediate.
>>>>> Auto Offline data collection on/off support.
>>>>> Suspend Offline collection upon new
>>>>> command.
>>>>> Offline surface scan supported.
>>>>> Self-test supported.
>>>>> Conveyance Self-test supported.
>>>>> Selective Self-test supported.
>>>>> SMART capabilities:            (0x0003) Saves SMART data before entering
>>>>> power-saving mode.
>>>>> Supports SMART auto save timer.
>>>>> Error logging capability:        (0x01) Error logging supported.
>>>>> General Purpose Logging supported.
>>>>> Short self-test routine
>>>>> recommended polling time: (   2) minutes.
>>>>> Extended self-test routine
>>>>> recommended polling time: ( 433) minutes.
>>>>> Conveyance self-test routine
>>>>> recommended polling time: (   5) minutes.
>>>>> SCT capabilities:        (0x7035) SCT Status supported.
>>>>> SCT Feature Control supported.
>>>>> SCT Data Table supported.
>>>>>
>>>>> SMART Attributes Data Structure revision number: 16
>>>>> Vendor Specific SMART Attributes with Thresholds:
>>>>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>>>>> UPDATED  WHEN_FAILED RAW_VALUE
>>>>>   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
>>>>> Always       -       0
>>>>>   3 Spin_Up_Time            0x0027   176   166   021    Pre-fail
>>>>> Always       -       6158
>>>>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age
>>>>> Always       -       80
>>>>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
>>>>> Always       -       0
>>>>>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age
>>>>> Always       -       0
>>>>>   9 Power_On_Hours          0x0032   075   075   000    Old_age
>>>>> Always       -       18465
>>>>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
>>>>> Always       -       0
>>>>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age
>>>>> Always       -       80
>>>>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       53
>>>>> 193 Load_Cycle_Count        0x0032   142   142   000    Old_age
>>>>> Always       -       174015
>>>>> 194 Temperature_Celsius     0x0022   121   107   000    Old_age
>>>>> Always       -       29
>>>>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
>>>>> Offline      -       0
>>>>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
>>>>> Always       -       0
>>>>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
>>>>> Offline      -       0
>>>>>
>>>>> SMART Error Log Version: 1
>>>>> No Errors Logged
>>>>>
>>>>> SMART Self-test log structure revision number 1
>>>>> Num  Test_Description    Status                  Remaining
>>>>> LifeTime(hours)  LBA_of_first_error
>>>>> # 1  Extended offline    Completed without error       00%     17347         -
>>>>> # 2  Short offline       Completed without error       00%     15414         -
>>>>>
>>>>> SMART Selective self-test log data structure revision number 1
>>>>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>>>>     1        0        0  Not_testing
>>>>>     2        0        0  Not_testing
>>>>>     3        0        0  Not_testing
>>>>>     4        0        0  Not_testing
>>>>>     5        0        0  Not_testing
>>>>> Selective self-test flags (0x0):
>>>>>   After scanning selected spans, do NOT read-scan remainder of disk.
>>>>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>>>>
>>>>> [dan@lamachine ~]$
>>>>>
>>>>>
>>>>> On Mon, 5 Oct 2020 at 14:44, Roman Mamedov <rm@xxxxxxxxxxx> wrote:
>>>>>>
>>>>>> On Mon, 5 Oct 2020 14:10:25 +0100
>>>>>> Daniel Sanabria <sanabria.d@xxxxxxxxx> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Scrubbing ( # echo check >
>>>>>>> /sys/devices/virtual/block/md1/md/sync_action) is killing my array :(
>>>>>>>
>>>>>>> I'm attaching details of the array and disks (bloody wd greens) as
>>>>>>> well as journalctl errors providing some details about the issue.
>>>>>>>
>>>>>>> If you have any pointers on what might be the cause of this as well as
>>>>>>> any recommendations on how to improve things please let me thank you
>>>>>>> in advance ...
>>>>>>>
>>>>>>> I have backups of the data so happy to move this to a different setup
>>>>>>> you might recommend (apps will be mostly reading from the array via
>>>>>>> NFS since most of the content will be media).
>>>>>>>
>>>>>>> My suspicion is that a timer service is kicking in and disrupting the
>>>>>>> scrubbing somehow but can't pinpoint what causes this.
>>>>>>
>>>>>> It looks like a drive is dropping off the bus and then failing to reidentify,
>>>>>> could be bad cabling/controller/PSU, or just a bad drive. You should post
>>>>>> "smartctl -a" of all drives as well.




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux