Re: RAID1 scrub ignoring read errors?

Brad Campbell <lists2009@xxxxxxxxxxxxxxx> · Tue, 4 Dec 2018 08:27:50 +0800

On 4/12/18 3:16 am, Phil Turmel wrote:
On 12/3/18 12:49 PM, Niklas Hambüchen wrote:
On 2018-12-03 18:35, Phil Turmel wrote:
Your drives appear to be fine.  I suspect you have a problem with other
hardware in this box.

When I repeat `smartctl -t short` on these disks, it fails at exactly the same sector:

Disk 1:
   Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
   # 1  Short offline       Completed: read failure       40%     16424         7501728
   # 2  Short offline       Completed: read failure       40%     16398         7501728

Disk 2:
   Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
   # 1  Short offline       Completed: read failure       50%     16424         1758544
   # 2  Short offline       Completed: read failure       50%     16398         1758544

Doesn't this suggest that this is not unfortunate behaviour of a power supply, but permanent damage to the disks (even if originally caused by a power or power supply problem)?

Those failures should increment your Current_Pending_Sector attribute in
those drives.  But you say those remain zero.  So I'm stumped.

Yeah, that's weird. A SMART test will abort on the first error and it 
always bumps the Current_Pending_Sector counts (well, it has on all my 
drives anyway).

Try running a read on the disk with :
dd if=/dev/sdX of=/dev/null bs=1M conv=noerror

That will read every sector (or block of 8 on a 4K drive) and will keep 
barging through after read failures. At the end of that you'll likely 
have a spray of Current_Pending_Sector(s) which will give you an 
indication of just how bad things are.

I'm with Phil. It sounds like a power issue. If there is a power hiccup 
while the drive is writing, it'll occasionally write out a corrupt 
sector. That turns into a URE as the ECC doesn't match, so the drive 
never gets a clean read. That also means it'll never auto-reallocate on 
read.

When you attempt to write it, it'll write cleanly and therefore leave no 
trace of reallocation in the SMART data (because it hasn't),

In the case of a genuine bad sector, the drive will try and write it, 
that will fail, so it'll try and write it elsewhere (reallocate). If the 
drive is truly dying it'll run out of spots to reallocate to and 
complain loudly. That's pretty rare however.

I have an old Toshiba laptop drive here (in a drawer) with a 
reallocation count in the high 4 digits, and it still has plenty to 
spare. The impact of that on system performance was massive as it was 
always seeking all over the platter to get the out of sequence 
reallocated sectors, but it was still working and reliable.

Regards,
Brad