RE: Bad blocks are killing us!

"Guy" <bugzilla@xxxxxxxxxxxxxxxx> · Mon, 22 Nov 2004 04:17:39 -0500

Summary for anyone that missed this thread.
If a RAID5 array is scanned to verify that the parity data matches the data,
how should the system handle a mismatch?  Assume no disks had read errors.
If the parity does not agree, then 1 or more disks are wrong.  The parity
disk could be wrong, in which case no data is lost or corrupt, yet.  But a
disk failure at this time would corrupt data unless it was the parity disk
for this stripe.  If any other disk is wrong, then data is corrupt.  If you
are lucky, the corrupt data will be un-used space.  In this case no data is
really corrupt, yet.

I agree with your assessment, or you agree with mine!  :)
I disagree on how it should be handled.

Now, what to do if a parity error occurs?  As I see it, we have these
possible options:

A. Ignore it.  This is what is done today.  But log the blocks affected.
Risk of data corruption is high.  And know that the risk of additional data
corruption is increased when a disk fails.  Re-building to the spare has the
effect of correcting the parity so the error is now masked.

B. Just correct the parity, you stand a high risk of data corruption without
knowing about it.  But without correcting the parity you still have the same
risk.  Log the blocks affected.  By correcting the parity, no additional
corruption will occur when a disk fails.

C. Mark all blocks (or chunks) affected by the parity error as unreadable.
This would cause data loss, but no corruption.  The data loss would be the
size of the mismatch (in blocks or chunks) times the number of disks in the
array - 1.  This acts more like a disk drive when a sector can't be read.
Log the blocks affected.  In the case of a 14 disk RAID5 array, a single
sector parity error would cause 13 sectors to be lost, or much more if going
by chunks.  Optionally, still allow option B at some later time at the
user's request.  This would allow the user to determine what data is
affected, then attempt to recover some of the data.

D. Report the error, and allow manual parity correction.  This is like
option A then at user request option B.

E. All of the above.  Have the option to choose which of the above you want.
This will allow each user to choose how the system will handle parity
errors.  This option should be configured per array, not system wide.

Guy

-----Original Message-----
From: linux-raid-owner@xxxxxxxxxxxxxxx
[mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of Dieter Stueken
Sent: Monday, November 22, 2004 3:22 AM
To: linux-raid@xxxxxxxxxxxxxxx
Subject: Re: Bad blocks are killing us!

Guy Watkins wrote:
> ... but the md-level
> approach might be better.  But I'm not sure I see the point of
> it---unless you have raid 6 with multiple parity blocks, if a disk
> actually has the wrong information recorded on it I don't think you
> can detect which drive is bad, just that one of them is."
> 
> If there is a parity block that does not match the data, true you do not
> know which device has the wrong data.  However, if you do not "correct"
the
> parity, when a device fails, it will be constructed differently than it
was
> before it failed.  This will just cause more corrupt data.  The parity
must
> be made consistent with whatever data is on the data blocks to prevent
this
> corrosion of data.  With RAID6 it should be possible to determine which
> block is wrong.  It would be a pain in the @$$, but I think it would be
> doable.  I will explain my theory if someone asks.

This is exactly the same conflict, a single drive has with a unreadable
sector.
It notices the sector as bad, and it can not fulfill any read request, until
the data is not rewritten or erased. The single drive can not (and should
never
try to!) silently replace the bad sector by some spare sectors, as it can
not
recover the content.

Also the RAID system can not solve this problem automagically, and never
should
do so, as the former content can not be deduced any more. But notice, that
we
have two very different problems to examine: The above problem arises, if
all
disks of the RAID system claim to read correct data, whereas the parity
information
tells us, that one of them must be wrong. As long as we don't have RAID6,
to recover single bit errors, the data is LOST and can not be recovered.

This is very different to the situation, when one of the disks DOES reports
an internal crc-error. In this case your data CAN be recovered reliable from
the
parity information, and in most cases successfully written back to the disk.

But there is also a difference between the problem for RAID compared to the
internal
disk: Whereas the disk always reads all CRC data for the sector to verify
its integrity,
the RAID system does not normally check the validity of the parity
information
by default. (this is, why the idea of data scans actually came up). So, if a
scan
discovers a bad parity information, the only action that can (and must!) be
taken
is, to tag this piece of data to be invalid. And it is very important, not
only
to log that information somewhere. It is even more important to prevent
further readings
of this piece of lost data. Otherwise those definitely invalid data may be
read
without any notice again, may even get written back again and thus turns
into valid
data, even though it become garbage.

People oftenargue for some spare sector management, which would solve all
problems.
I think this is an illusion. Spare sectors can only be useful if you fail
WRITING data,
not when reading data failed or data loss occurred. This is realized already
within
the single disks in a sufficient way (I think). If your disk gives write
errors, you
either have a very old one, without internal spare sector management, or
your disk
run out of spare sectors already. Read errors are quite more frequent than
write
errors and thus a much more important issue.

Dieter Stüken.
-- 
Dieter Stüken, con terra GmbH, Münster
     stueken@xxxxxxxxxxx
     http://www.conterra.de/
     (0)251-7474-501
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html