Re: URE, link resets, user hostile defaults

Zygo Blaxell <u0oo5pgu@xxxxxxxxxxxxxxxxxxxxx> · Wed, 29 Jun 2016 08:17:51 -0400

On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote:
> On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@xxxxxxx> wrote:
> > Can you post a message log detailing this problem?
>
> Just over the weekend Phil Turmel posted an email with a bunch of back
> reading on the subject of timeout mismatches for someone to read. I've
> lost track of how many user emails he's replied to, discovering this
> common misconfiguration, and get it straightened out and more often
> than not helping the user recover data that otherwise would have been
> lost *because* of hard link resetting instead of explicit read errors.

OK, but the two links you provided are not examples of these.

> http://www.spinics.net/lists/raid/msg50289.html

This one is basically a software or pilot-error problem that lead to a
partition table being destroyed (with a dash of terrible advice along
the way, like "pull two disks out of the machine and see if the array
recovers").  The one SATA link reset in the logs took all of 9ms to
report a drive error about 4 seconds after boot.  Nothing about this
would be affected by changing the 30-second SATA timeout.

> http://www.spinics.net/lists/raid/msg52789.html

This one is a RAID5 array that was in degraded mode for a *year* before
it was finally taken out by a second disk failure.  Data loss is the
expected outcome given those conditions--you don't get to keep your
data if you ignore drive failures for a year!  Changing the timeout
to expose latent UREs could not have helped in that case--errors were
already detected, but the admin ignored their monitoring responsibility
and just left the array to die.

> He isn't the only list regular who helps educate users tirelessly with
> this very repetitive work around

He repeats it a lot, to be sure, and he's not wrong--but it doesn't seem
to be relevant in those specific examples.  Timeout mismatch mitigation
is presented before any causal analysis of the reported failure.

There's a use case for the long timeout in situations where the system
is no longer healthy and ddrescue/myrescue-style tools are in play.

In redundant setups that are still healthy, the time to error detection
should be as short as possible so repair can start sooner, while still
long enough to avoid crazy numbers of false positives.  Unfortunately
that's not what seems to happen if the Linux-side timeout is shortened.

> for a very old misconfiguration that
> as far as I can tell only exists on Linux. And it's the default
> behavior.
[...]
> Usually recoveries don't take minutes. But they can take minutes. And
> that's where the problem comes in. I don't see why the user should be
> the one punished by the kernel, which is in effect what a 30 second
> default command timer is doing.

Long timeouts don't really serve anyone, even in single-disk cases.

I was once presented a machine with an obvious disk failure--painfully
slow multi-minute application startup times, and the disk was making loud
clicking/rattling noises--but the drive was never reporting any problems
to the OS (Windows, as it happened).  The machine's owners would not
believe that its disk had failed due to the lack of reported errors, and
would not authorize a test build with a new disk restored from backups
(an expensive proposition since there weren't any, and making a copy
of this broken disk would have taken days if it was successful at all).
The owners were convinced it was some sort of software problem.  Finally I
told the users to run a drive self-test over a weekend and--after about
40 hours and only 4% of the disk tested--it finally found a bad sector
it couldn't read, and generated an error code that would get the drive
replaced.  Apparently the machine's users had been living with this
for three months before I got there, and the machine was unusable the
whole time.  A much shorter error timeout would at least have provided
evidence of a hardware problem, even if it was the wrong one.
Attachment:
signature.asc

Description: Digital signature