Re: Recent drive errors

Thomas Fjellstrom <thomas@xxxxxxxxxxxxx> · Tue, 19 May 2015 08:32:21 -0600

On Tue 19 May 2015 09:23:20 AM Phil Turmel wrote:
> On 05/19/2015 08:50 AM, Thomas Fjellstrom wrote:
> > On Tue 19 May 2015 08:34:55 AM Phil Turmel wrote:
> >> Based on the smart report, this drive is perfectly healthy.  A small
> >> number of uncorrectable read errors is normal in the life of any drive.
> > 
> > Is it perfectly normal for the same sector to be reported uncorrectable 5
> > times in a row like it did?
> 
> Yes, if you keep trying to read it.  Unreadable sectors stay unreadable,
> generally, until they are re-written.  That's the first opportunity the
> drive has to decide if a relocation is necessary.
> 
> > How many UREs are considered "ok"? Tens, hundreds, thousands, tens of
> > thousands?
> 
> Depends.  In a properly functioning array that gets scrubbed
> occasionally, or sufficiently heavy use to read the entire contents
> occasionally, the UREs get rewritten by MD right away.  Any UREs then
> only show up once.

I have made sure that it's doing regular scrubs, and regular SMART scans. This 
time...

> In a desktop environment, or non-raid, or improperly configured raid,
> the UREs will build up, and get reported on every read attempt.
> 
> Most consumer-grade drives claim a URE average below 1 per 1E14 bits
> read.  So by the end of their warranty period, getting one every 12TB
> read wouldn't be unusual.  This sort of thing follows a Poisson
> distribution:
> 
> http://marc.info/?l=linux-raid&m=135863964624202&w=2
> 
> > These drives have been barely used. Most of their life, they were either
> > off, or not actually being used. (it took a while to collect enough 3TB
> > drives, and then find time to build the array, and set it up as a regular
> > backup of my 11TB nas).
> 
> While being off may lengthen their life somewhat, the magnetic domains
> on these things are so small that some degradation will happen just
> sitting there.  Diffusion in the p- and n-doped regions of the
> semiconductors is also happening while sitting unused, degrading the
> electronics.
> 
> >>  It has no relocations, and no pending sectors.  The latency spikes are
> >> 
> >> likely due to slow degradation of some sectors that the drive is having
> >> to internally retry to read successfully.  Again, normal.
> > 
> > The latency spikes are /very/ regular and theres quite a lot of them.
> > See: http://i.imgur.com/QjTl6o3.png
> 
> Interesting.  I suspect that if you wipe that disk with noise, read it
> all back, and wipe it again, you'll have a handful of relocations.

It looks like each one of the blocks in that display is 128KiB. Which i think 
means those red blocks aren't very far apart. Maybe 80MiB apart? Would it 
reallocate all of those? That'd be a lot of reallocated sectors.

> Your latency test will show different numbers then, as the head will
> have to seek to the spare sector and back whenever you read through one
> of those spots.
> 
> Or the rewrites will fix them all, and you'll have no further problems.
>  Hard to tell.  Bottom line is that drives can't fix any problems they
> have unless they are *written* in previously identified problem areas.
> 
> >> I own some "DM001" drives -- they are unsuited to raid duty as they
> >> don't support ERC.  So, out of the box, they are time bombs for any
> >> array you put them in.  That's almost certainly why they were ejected
> >> from your array.
> >> 
> >> If you absolutely must use them, you *must* set the *driver* timeout to
> >> 120 seconds or more.
> > 
> > I've been planning on looking into the ERC stuff. I now actually have some
> > drives that do support ERC, so it'll be interesting to make sure
> > everything is set up properly.
> 
> You have it backwards.  If you have WD Reds, they are correct out of the
> box.  It's when you *don't* have ERC support, or you only have desktop
> ERC, that you need to take special action.

I was under the impression you still had to enable ERC on boot. And I 
/thought/ I read that you still want to adjust the timeouts, though not the 
same as for consumer drives.

> If you have consumer grade drives in a raid array, and you don't have
> boot scripts or udev rules to deal with timeout mismatch, your *ss is
> hanging in the wind.  The links in my last msg should help you out.

There was some talk of ERC/TLER and md. I'll still have to find or write a 
script to properly set up timeouts and enable TLER on drives capable of it 
(that don't come with it enabled by default).

> Also, I noticed that you used "smartctl -a" to post a complete report of
> your drive's status.  It's not complete.  You should get in the habit of
> using "smartctl -x" instead, so you see the ERC status, too.

Good to know. Thanks.

> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thomas Fjellstrom
thomas@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html