Re: [PATCH] raid456: avoid second retry of read-error

Roger Heflin <rogerheflin@xxxxxxxxx> · Wed, 6 Nov 2019 18:57:40 -0600

A lot would depend on what you get back from the disk.  If it reports
an actual confirmed media error generally it means the disk tried up
to it set time, if you get a failed to respond that would be something
to retry, but I think a failure to respond is not generally retried.
At one second with a average seek time of 10ms that means it attempted
to reread that sector close to 100 times without success.   Generally
the timeouts are 7 seconds if one does not change it so 700 or so
retries.   And for that timeout the application stops responding and 7
seconds is a long time.  Now if one wanted to add in retries one could
configured multipath to operate over the disks (with a single path)
and set the retry rules inside it.  I accidentally tested that as a
version of fedora a few years ago installed and enabled multipath and
when my raid machine booted up 1/2 of the raid6 disks were running
through multipath and 1/2 of them were not (race condition).  I did
not notice it for several days as it was cleanly working.

Error risks:
Assuming a disk fails 32k sections as a group (mine seem to almost
always re-write 8 sections of 8x512 each). so I am going to make
assumption that the disk is playing some sort of internal games, but
if it really is only 4k then all of the numbers below get much much
better.

Further assuming you have 1000 bad blocks on a disk and it is a 1tb
disk, then you have 31,250,000 sectors.    So the odds of hitting a
bad sector in the same part of another disk when you one error is
(disk_count*1000)/31250000, or about 1 in 4000.    Now while that
seems high, I would say that having all 8 disks with an average error
count of 1000 bad sectors on it would mean that your disks really
should have been replaced a long time ago.     When I have had disks
really acting badly typically only one disks has higher error counts
and one other has only a few, and the rest are clean.  The more
realistic would be 100 errors only on 2 disks reducing the odds to  1
in 156000.   And this is for the raid5 case, the raid6 case were one
needs to have 3 bad sectors at the same spot in the disk is much
safer.   And if you do happen to hit a 2 bad sectors in the raid5 case
the array will stop and you can then force it back online with all
disks.  The raid5 case is rather risky if one completely fails as disk
as then if you have any other bad sectors you have lost data, so I
don't know that I would ever run the raid5 case with spinning disks
anymore, this case is probably the most likely case one will run into
causing data loss.

On Wed, Nov 6, 2019 at 10:53 AM Wols Lists <antlists@xxxxxxxxxxxxxxx> wrote:
>
> On 05/11/19 22:46, Nigel Croxon wrote:
> >
> > On 11/4/19 7:33 PM, Wols Lists wrote:
> >> On 04/11/19 20:01, Nigel Croxon wrote:
> >>> The MD driver for level-456 should prevent re-reading read errors.
> >>>
> >>> For redundant raid it makes no sense to retry the operation:
> >>> When one of the disks in the array hits a read error, that will
> >>> cause a stall for the reading process:
> >>> - either the read succeeds (e.g. after 4 seconds the HDD error
> >>> strategy could read the sector)
> >>> - or it fails after HDD imposed timeout (w/TLER, e.g. after 7
> >>> seconds (might be even longer)
> >> Okay, I'm being completely naive here, but what is going on? Are you
> >> saying that if we hit a read error, we just carry on, ignore it, and
> >> calculate the missing block from parity?
> >>
> >> If so, what happens if we hit two errors on a raid-5, or 3 on a raid-6,
> >> or whatever ... :-)
> >>
> >> Cheers,
> >> Wol
> >
> > This allows the device (disk) to fail faster.  All logic is the same.
> >
> > If there is a read error, it does not retry that read, it calculates
> >
> > the data from the other disks.  This patch removes the retry.
> >
> Ummm ...
>
> I suspect there is a very good reason for the retry ...
>
> Bear in mind I don't actually KNOW anything, you'll need to check with
> someone who knows about these things, but I get the impression that
> transient errors aren't that uncommon. It fails, you try again, it succeeds.
>
> So if you're going to go down that route, by all means re-calculate from
> parity if ONE read fails, but if you get more failures such that the
> raid fails, you need to retry those reads because there is a good chance
> they will succeed second time round.
>
> Cheers,
> Wol
>