Re: Fault tolerance with badblocks

David Brown <david.brown@xxxxxxxxxxxx> · Wed, 10 May 2017 10:41:14 +0200

On 09/05/17 22:18, Nix wrote:
> On 9 May 2017, Chris Murphy verbalised:
> 
>> On Tue, May 9, 2017 at 5:58 AM, David Brown <david.brown@xxxxxxxxxxxx> wrote:
>>
>>> I thought you said that you had read Neil's article.  Please go back and
>>> read it again.  If you don't agree with what is written there, then
>>> there is little more I can say to convince you.
> 
> The entire article is predicated on the assumption that when an
> inconsistent stripe is found, fixing it is simple because you can just
> fail whichever device is inconsistent... but given that the whole
> premise of the article is that *you cannot tell which that is*, I don't
> see the point in failing anything.

The point is that if an inconsistent stripe is found, then there is no
way to be sure how to fix it correctly.  So scrub certainly will not
touch it.  And what should "repair" do?  I see several choices:

1. It could assume the data is correct, and re-create the parities.
This is simple, and it avoids changing anything on the array from the
viewpoint of higher levels (i.e., the filesystem).

2. It could do a "smart" repair of the stripe, if it sees that there is
only one inconsistent block in the stripe.

3. It could pass the problem on to higher level tools (possibly
correcting a single inconsistency in the P or Q parities first).

At the moment, raid6 repair follows the first choice here.  Many people
seem to think the second choice is a good idea.  Personally, I would say
choice 3 is right - but unless and until higher level tools are
available, I think 1 is no worse than 2 - and it is simpler, clearer,
and works today.

Key to why I don't like choice 2 is a question of why you have a
mismatch in the first place.  Undetected read errors - the drive
returning wrong data as though it were correct data - are astoundingly
rare.  Even on huge disks, they do not occur often.  (Unrecoverable read
errors - when the drive reports a sector as unreadable - are not
uncommon.  That is what raid is for.)  If you get a mismatch, a likely
cause is a crash or power fault during a stripe write.  Another main
cause is hardware errors such as memory faults.  "Smart" repair can make
the situation worse.

Secondly, "smart" repair means changing the data on the disk.  You can't
do that while a file system is mounted (unless you want to risk chaos).
 One major reason for using raid is to minimise downtime of a system in
the event of problems - offline repair goes against that philosophy.

What do I mean about passing the problem on to higher levels?  One
example would be if there is an other raid level sitting above, such as
a raid1 pair of raid6 arrays (it would make more sense the other way
round - the same principle applies there).  The raid6 level could ask
the block layer above if that layer can re-create the correct data.  In
the case of a raid1 pair at a higher level, then it could - that way the
stripe would be written with the full known correct data, rather than
just a guess.  Perhaps the layer above is a filesystem - this could say
if that stripe is actually in use (no need to worry if it is in deleted
space), or if it can re-create the data from a BTRFS duplicate.

Failing that, a tool could interact with the filesystem to determine
what sort of data was on that stripe, and perhaps check it in some way.
 At least a tool could run a consistency check - would the filesystem be
consistent if the stripe was "smart repaired", or would it be consistent
if the stripe data was left untouched (and the P & Q parities recreated)?

A simple method here could be to mark the whole stripe as unreadable,
then run a filesystem check.  If there are higher level raids that can
re-create the lost stripe, that will happen automatically.  If not, then
the filesystem repair will ensure that the filesystem is consistent even
though data may be lost.

And of course, a higher level repair tool could be one that simply runs
a "smart repair" on the stripe.

All in all, when there is /no/ correct answer, I think we have to be
very careful about picking methods here.  Before switching to a "smart"
repair, rather than the simple method, we have to be /very/ sure that it
gives noticeably "better" results in real-world cases.  We can't just
say it sounds good - we need to know.

> 
> The first comment in the article is someone noting that md doesn't say
> which device is failing, what the location of the error is or anything
> else a sysadmin might actually find useful for fixing it. "Hey, you have
> an error somewhere on some disk on this multi-terabyte array which might
> be data corruption and if a disk fails will be data corruption!" is not
> too useful :( 

I haven't looked at the information you get out of the scrub, but of
course more information is better than less information.

> The fourth comment notes that the "smart" approach, given
> RAID-6, has a significantly higher chance of actually fixing the problem
> than the simple approach. I'd call that a fairly important comment...
> 
> (Neil said: "Similarly a RAID6 with inconsistent P and Q could well not
> be able to identify a single block which is "wrong" and even if it could
> there is a small possibility that the identified block isn't wrong, but
> the other blocks are all inconsistent in such a way as to accidentally
> point to it. The probability of this is rather small, but it is
> non-zero".

It is true that for some causes of mismatches, the "smart" repair has a
high chance of being correct.

> As far as I can tell the probability of this is exactly the
> same as that of multiple read errors in a single stripe -- possibly far
> lower, if you need not only multiple wrong P and Q values but *precisely
> mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using
> RAID-6 to begin with.
> 
> I've been talking all the time about a stripe which is singly
> inconsistent: either all the data blocks are fine and one of P or Q is
> fine, or both P and Q and all but one data block is fine, and the
> remaining block is inconsistent with all the rest. Obviously if more
> blocks are corrupt, you can do nothing but report it. The redundancy
> simply isn't there to attempt repair.)

Or possible mark the whole stripe as "unreadable", and punt the problem
to the higher levels.

> 
>> H. Peter Anvin's RAID 6 paper, section 4 is what's apparently under discussion
>> http://milbret.anydns.info/pub/linux/kernel/people/hpa/raid6.pdf
>>
>> This is totally non-trivial, especially because it says raid6 cannot
>> detect or correct more than one corruption, and ensuring that
>> additional corruption isn't introduced in the rare case is even more
>> non-trivial.
> 
> Yeah. Testing this is the bastard problem, really. Fault injection via
> dm is the only approach that seems remotely practical to me.

That's what the "FAULTY" raid level in md is for :-)

But what are the /realistic/ fault situations?

> 
>> I do think it's sane for raid6 repair to avoid the current assumption
>> that data strip is correct, by doing the evaluation in equation 27. If
>> there's no corruption do nothing, if there's corruption of P or Q then
>> replace, if there's corruption of data, then report but do not repair
> 
> At least indicate *where* the corruption is in the report. (I'd say
> "repair, as a non-default option" for people with a different
> availability/P(corruption) tradeoff -- since, after all, if you're using
> RAID In the first place you value high availability across disk problems
> more than most people do, and there is a difference between one bit of
> unreported damage that causes a near-certain restore from backup and
> either zero or two of them plus a report with an LBA attached so you
> know you need to do something...)

One thing to consider here is the sort of person using the raid array.
When Neil wrote his article, raid6 would only be used by an expert.  He
did not want to change existing data and make life harder for the
systems administrator doing more serious repair.

However, these days the raid6 "administrator" may be someone who owns a
NAS box and has no idea what raid, or even Linux, actually is.  In such
cases, "smart" repair is probably the best idea if the filesystem on top
is not BTRFS.

> 
>> as follows:
>>
>> 1. md reports all data drives and the LBAs for the affected stripe
>> (otherwise this is not simple if it has to figure out which drive is
>> actually affected but that's not required, just a matter of better
>> efficiency in finding out what's really affected.)
> 
> Yep.
> 
>> 2. the file system needs to be able to accept the error from md
> 
> It would probably need to report this as an -EIO, but I don't know of
> any filesystems that can accept asynchronous reports of errors like
> this. You'd need reverse mapping to even stand a chance (a non-default
> option on xfs, and of course available on btrfs and zfs too). You'd
> need self-healing metadata to stand a chance of doing anything about it.
> And god knows what a filesystem is meant to do if part of the file data
> vanishes. Replace it with \0? ugh. I'd almost rather have the error
> go back out to a monitoring daemon and have it send you an email...
> 
>> 3. the file system reports what it negatively impacted: file system
>> metadata or data and if data, the full filename path.
>>
>> And now suddenly this work is likewise non-trivial.
> 
> Yeah, it's all the layers stacked up to the filesystem that are buggers
> to deal with... and now the optional 'just repair it dammit' approach
> seems useful again, if just because it doesn't have to deal with all
> these extra layers.
> 
>> And there is already something that will do exactly this: ZFS and
>> Btrfs. Both can unambiguously, efficiently determine whether data is
>> corrupt even if a drive doesn't report a read error.
> 
> Yeah. Unfortunately both have their own problems: ZFS reimplements the
> page cache and adds massive amounts of ineffiicency in the process, and
> btrfs is... well... not really baked enough for the sort of high-
> availability system that's going to be running RAID, yet. (Alas!)

I disagree about BTRFS here.  First, raid is a good idea no matter how
"experimental" you consider your filesystem.  Second, BTRFS is solid
enough for a great many uses - I use it on laptops, desktops and servers.

/No/ storage system should be viewed as infallible - backups are
important.  So if BTRFS were to eat my data, then I'd get it back from
backups - just as I would if the server died, both disks failed, it got
stolen, or whatever.

But BTRFS on our servers means very cheap regular snapshots.  That
protects us from the biggest cause of data loss - user error.

> 
> (Recent xfs can do the same with metadata, but not data.)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html