On Mon, 2015-03-02 at 11:32 -0700, Chris Murphy wrote: > [253981.512570] sd 5:0:0:0: [sdf] > [253970.368375] Add. Sense: Unrecovered read error - auto reallocate failed > [253970.368380] end_request: I/O error, dev sdf, sector 5680577752 > > I'm confused. The above happens twice. So it seems clear the problem is > with /dev/sdf and sector 5680577752. Since it's an AF drive, technically > sectors 5680577752 - 5680577760 are affected, since those are the LBA's for > a single physical sector. > > However, all of the "read error corrected" that follow have completely > different values, 5478837464 through 547883753. > > And then 3 seconds later another read error at the same LBA: > > [253977.980604] sd 5:0:0:0: [sdf] > 253977.980605] Add. Sense: Unrecovered read error - auto reallocate failed > [253977.980612] end_request: I/O error, dev sdf, sector 5680579352 > > and 4 seconds later > > [253981.512576] sd 5:0:0:0: [sdf] > [253981.512577] Add. Sense: Unrecovered read error - auto reallocate failed > [253981.512582] end_request: I/O error, dev sdf, sector 5680579352 > > > And then "read error corrected" 5478839064 through 5478839136 which are > different than the first batch. > > So there's a single LBA reported by libata as URE multiple times, each with > identical address. But then two corrected events, each with a different > range of sectors, neither of which match the URE address. > > ?? I have no idea about the differing sector locations, way beyond my knowledge... however one thought did occur to me. As the drives are WD reds with TLER enabled, as the drive realised that an error occurred, instead of performing a few read tests and then possibly a relocate or re write or what ever a drive may try... would its first imperative be to "chuck the error out, let the OS/raid card deal with it" which is why no pending or relocates or other errors showed in the smartctl except the increase in the RRER to 4 prior to running a smartctl scan. After the smartctl scan no values changed, except for the addition of > # 1 Extended offline Completed without error 00% 10949 > - The messages about "read error corrected" were generated by mdadm (I'm assuming given the text), and as you say the initial errors were generated by libata (which I assume is the disk subsystem?) so perhaps it has a different idea about sectors (logical v physical?) or sectors within the raid device (the raid data location within the logical partition within the raid member device?) The numbers seem well off, 5680577752 (disk) v 5478837464-5478837536 (mdadm) so perhaps the mdadm figure is the sector within the raid member within partition 5 within the disk sdf? > > > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html