Re: URE, link resets, user hostile defaults

Pasi Kärkkäinen <pasik@xxxxxx> · Fri, 19 Aug 2016 13:00:10 +0300



ping

Let's not forget this thread :)


-- Pasi

On Tue, Jul 05, 2016 at 12:43:04AM +0300, Pasi Kärkkäinen wrote:
> On Wed, Jun 29, 2016 at 08:17:51AM -0400, Zygo Blaxell wrote:
> > On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote:
> > > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@xxxxxxx> wrote:
> > > > Can you post a message log detailing this problem?
> > >
> > > Just over the weekend Phil Turmel posted an email with a bunch of back
> > > reading on the subject of timeout mismatches for someone to read. I've
> > > lost track of how many user emails he's replied to, discovering this
> > > common misconfiguration, and get it straightened out and more often
> > > than not helping the user recover data that otherwise would have been
> > > lost *because* of hard link resetting instead of explicit read errors.
> > 
> > OK, but the two links you provided are not examples of these.
> > 
> 
> Here's one of the threads where Phil explains the issue:
> 
> http://marc.info/?l=linux-raid&m=133665797115876&w=2
> 
> quote:
> 
> 
> "A very common report I see on this mailing list is people who have lost arrays 
> where the drives all appear to be healthy.  
> Given the large size of today's hard drives, even healthy drives will occasionally 
> have an unrecoverable read error.
> 
> When this happens in a raid array with a desktop drive without SCTERC,
> the driver times out and reports an error to MD.  MD proceeds to
> reconstruct the missing data and tries to write it back to the bad
> sector.  However, that drive is still trying to read the bad sector and
> ignores the controller.  The write is immediately rejected.  BOOM!  The
> *write* error ejects that member from the array.  And you are now
> degraded.
> 
> If you don't notice the degraded array right away, you probably won't
> notice until a URE on another drive pops up.  Once that happens, you
> can't complete a resync to revive the array.
> 
> Running a "check" or "repair" on an array without TLER will have the
> opposite of the intended effect: any URE will kick a drive out instead
> of fixing it.
> 
> In the same scenario with an enterprise drive, or a drive with SCTERC
> turned on, the drive read times out before the controller driver, the
> controller never resets the link to the drive, and the followup write
> succeeds.  (The sector is either successfully corrected in place, or
> it is relocated by the drive.)  No BOOM."
> 
> 
> 
> -- Pasi
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html