Re: Riad scrub generated errors, should I worry?

"Wilson, Jonathan" <piercing_male@xxxxxxxxxxx> · Mon, 2 Mar 2015 19:45:36 +0000

On Mon, 2015-03-02 at 11:32 -0700, Chris Murphy wrote:
> [253981.512570] sd 5:0:0:0: [sdf]
> [253970.368375] Add. Sense: Unrecovered read error - auto reallocate failed
> [253970.368380] end_request: I/O error, dev sdf, sector 5680577752
> 
> I'm confused. The above happens twice. So it seems clear the problem is
> with /dev/sdf and sector 5680577752. Since it's an AF drive, technically
> sectors 5680577752 - 5680577760 are affected, since those are the LBA's for
> a single physical sector.
> 
> However, all of the "read error corrected" that follow have completely
> different values, 5478837464 through 547883753.
> 
> And then 3 seconds later another read error at the same LBA:
> 
> [253977.980604] sd 5:0:0:0: [sdf]
> 253977.980605] Add. Sense: Unrecovered read error - auto reallocate failed
> [253977.980612] end_request: I/O error, dev sdf, sector 5680579352
> 
> and 4 seconds later
> 
> [253981.512576] sd 5:0:0:0: [sdf]
> [253981.512577] Add. Sense: Unrecovered read error - auto reallocate failed
> [253981.512582] end_request: I/O error, dev sdf, sector 5680579352
> 
> 
> And then "read error corrected" 5478839064 through 5478839136 which are
> different than the first batch.
> 
> So there's a single LBA reported by libata as URE multiple times, each with
> identical address. But then two corrected events, each with a different
> range of sectors, neither of which match the URE address.
> 
> ??

I have no idea about the differing sector locations, way beyond my
knowledge... however one thought did occur to me.

As the drives are WD reds with TLER enabled, as the drive realised that
an error occurred, instead of performing a few read tests and then
possibly a relocate or re write or what ever a drive may try... would
its first imperative be to "chuck the error out, let the OS/raid card
deal with it" which is why no pending or relocates or other errors
showed in the smartctl except the increase in the RRER to 4 prior to
running a smartctl scan. After the smartctl scan no values changed,
except for the addition of 

> # 1  Extended offline    Completed without error       00%     10949
> -

The messages about "read error corrected" were generated by mdadm (I'm
assuming given the text), and as you say the initial errors were
generated by libata (which I assume is the disk subsystem?) so perhaps
it has a different idea about sectors (logical v physical?) or sectors
within the raid device (the raid data location within the logical
partition within the raid member device?)

The numbers seem well off, 5680577752 (disk) v 5478837464-5478837536
(mdadm) so perhaps the mdadm figure is the sector within the raid member
within partition 5 within the disk sdf?

> 
> 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html