Re: SOLVED [was Re: GPT corruption on Primary Header, backup OK, fixing primary nuked array -- help?]

Andreas Dröscher <raid@xxxxxxxxxx> · Thu, 28 Jul 2016 22:51:55 +0200

Am 28.07.16 um 14:53 schrieb Anthony Youngman:
> On 28/07/16 00:10, David C. Rankin wrote:
>> On 07/27/2016 08:04 AM, Anthony Youngman wrote:
>>> WD Blacks? Do they support SCT/ERC? I think these are desktop drives (like my
>>> Barracudas) so you WILL get bitten by the timeout problem if anything goes
>>> wrong. Do you know what you're doing here?
>> Yes, WD Blacks, and yes, at least for the last 16 years I've managed, somehow,
>> to provide a complete open-source backend for my law office. So I would answer
>> the 2nd question in the affirmative as well. You can poo-poo drive X verses
>> drive Y all you want, but I get a consistent 5 years out of each WD black and
>> plan on a replacement cycle of 1/2 that. Go with what works for you.
>>
> I'll just say I don't think the past 16 years is a good guide at all ... (but I
> will add I'm doing exactly the same as you - two 3TB desktop drives in a mirror
> :-).
> 
> The timeout problem seems to be relatively recent. MOST 1TB or less drives don't
> seem to have an issue. It's bigger drives that will bite you.
> 

All drives are tailored to a use case: price, power consumption (e.g. WD Green),
desktop performance (WD Black) and Raid (WD Red or WD Enterprise Storage - also
black label). One of the key feature of raid drives is TLER (Time Limited error
recovery). Note: the name my vary by brand.

My WD-ES drive shows:
smartctl -l scterc /dev/sda
smartctl 6.6 2016-05-07 r4319 [...] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

server ~ #

That means that a drive will report a media error after 7 seconds and leave it
to the raid controller / raid subsystem to recover it. Linux-Raid usually
re-covers and re-writes the bad block from the remaining drives, fixing the
issue (The drives firmware relocates the sector).

Non raid optimized drives may spend a long time trying to recover such a sector.
Hence the raid controller will not simply fix the sector but fail the entire
drive for not responding. For this reason, an array can fail, that would have
not with proper.

The issue can be relaxed by tuning SCT or /sys/block/sda/device/timeout.

- Andreas

<<attachment: smime.p7s>>