Re: RAID1 scrub ignoring read errors?

Niklas Hambüchen <mail@xxxxxx> · Sun, 2 Dec 2018 23:00:21 +0100

Hey Phil,

thanks for your swift reply.

> Disks don't need replacing on occassional read errors, because they are
> normal.  Typical consumer-grade hard drives quote a unrecoverable read
> error rate of under 1x10^-14.  That works out to, on average, one URE
> every 12.5 TB read.  On large drives and large arrays of drives, that's
> just a few reads from end to end.

This makes sense.

But does it apply here, given the flood of read errors in my dmesg in just a single scrub?
The probability for that many errors for a single pass over 3 GB seems very low.

I also read with interest your mentions of the timeout problem as well as:

  https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
  http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/

Could the timeout problem cause the flood of read errors?
I am not sure how to decide that from the dmesg output.

On the timeout topic, the disks in question are WD Red 3TB, and I get:

  $ smartctl -l scterc /dev/sdb
  SCT Error Recovery Control:
             Read:     70 (7.0 seconds)
            Write:     70 (7.0 seconds)

Another data point possibly relevant:

Even after I wait a many minutes longer than the problematic 2 minutes timeout threshold mentioned, a short self test with `smartctl -t short` immediately turns up read errors for both disks:

Num  Test_Description  Status                  Remaining LifeTime(hours) LBA_of_first_error
Disk 1:
# 1  Short offline     Completed: read failure       40%    16398        7501728
Disk 2:
# 1  Short offline     Completed: read failure       50%    16398        1758544

I interpret this as the disks having real problems as opposed to UREs according to the specified error rate.
What do you think?

Thanks!
Niklas