ping Let's not forget this thread :) -- Pasi On Tue, Jul 05, 2016 at 12:43:04AM +0300, Pasi Kärkkäinen wrote: > On Wed, Jun 29, 2016 at 08:17:51AM -0400, Zygo Blaxell wrote: > > On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote: > > > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@xxxxxxx> wrote: > > > > Can you post a message log detailing this problem? > > > > > > Just over the weekend Phil Turmel posted an email with a bunch of back > > > reading on the subject of timeout mismatches for someone to read. I've > > > lost track of how many user emails he's replied to, discovering this > > > common misconfiguration, and get it straightened out and more often > > > than not helping the user recover data that otherwise would have been > > > lost *because* of hard link resetting instead of explicit read errors. > > > > OK, but the two links you provided are not examples of these. > > > > Here's one of the threads where Phil explains the issue: > > http://marc.info/?l=linux-raid&m=133665797115876&w=2 > > quote: > > > "A very common report I see on this mailing list is people who have lost arrays > where the drives all appear to be healthy. > Given the large size of today's hard drives, even healthy drives will occasionally > have an unrecoverable read error. > > When this happens in a raid array with a desktop drive without SCTERC, > the driver times out and reports an error to MD. MD proceeds to > reconstruct the missing data and tries to write it back to the bad > sector. However, that drive is still trying to read the bad sector and > ignores the controller. The write is immediately rejected. BOOM! The > *write* error ejects that member from the array. And you are now > degraded. > > If you don't notice the degraded array right away, you probably won't > notice until a URE on another drive pops up. Once that happens, you > can't complete a resync to revive the array. > > Running a "check" or "repair" on an array without TLER will have the > opposite of the intended effect: any URE will kick a drive out instead > of fixing it. > > In the same scenario with an enterprise drive, or a drive with SCTERC > turned on, the drive read times out before the controller driver, the > controller never resets the link to the drive, and the followup write > succeeds. (The sector is either successfully corrected in place, or > it is relocated by the drive.) No BOOM." > > > > -- Pasi > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html