On 2/9/20 23:36, Roy Sigurd Karlsbakk wrote:
----- Original Message -----
From: "David C. Rankin" <drankinatty@xxxxxxxxxxxxxxxxxx>
To: "Linux Raid" <linux-raid@xxxxxxxxxxxxxxx>
Sent: Saturday, 22 August, 2020 03:42:40
Subject: Re: Feature request: Remove the badblocks list
On 8/18/20 4:03 PM, Håkon Struijk Holmen wrote:
Hi,
Thanks for the CC, I just managed to get myself subscribed to the list :)
I have gathered some thoughts on the subject as well after reading up on it,
figuring out the actual header format is, and writing a tool [3] to fix my
array...
<snip>
But I have some complaints about the thing..
Well,
There is code in all things that can be fixed, but I for one will chime in
and say I don't care if a lose a strip or two so long as on a failed disk I
pop the new one in and it rebuilds without issue (which it does, even when the
disk was replaced due to bad blocks)
So whatever is done, don't fix what isn't broken and introduce more bugs
along the way. If this is such an immediate problem, then why are patches
being attached to the complaints?
The problem is that it's already broken. Take a single mirror. One drive experiences a bad sector, fine, you have redundancy, so you read the data from the other drive and md flags the sector as bad. The drive two is replaced, you lose the data. The new drive will get flagged with the same sector number as faulty, since the first drive has it flagged. So you replace the first drive and during resync, it also gets flagged as having a bad sector. And so on.
Modern (that is, disks since 20 years ago or so) reallocate sectors as they wear out. We have redundancy to handle errors, not to pinpoint them on disks and fill up not-so-smart lists with broken sectors that work. If md sees a drive with excessive errors, that drive should be kicked out, marked as dead, but not interfere with the rest of the raid.
Vennlig hilsen
roy
I'm no MD expert, but I there are a couple of things to consider...
1) MD doesn't mark the sector as bad unless we try to write to it, AND
the drive replies to say it could not be written. So, in your case, the
drive is saying that it doesn't have any "spare" sectors left to
re-allocate, we are already passed that point.
2) When MD tries to read, it gets an error, so read from the other
mirror, or re-construct from parity/etc, and automatically attempt to
write to the sector, see point 1 above for the failure case.
So by the time MD gets a write error for a sector, the drive really is
bad, and MD can no longer ensure that *this* sector will be able to
properly store data again (whatever level of RAID we asked for, that
level can't be achieved with one drive faulty). So MD marks it bad, and
won't store any user data in that sector in future. As other drives are
replaced, we mark the corresponding sector on those drives as also bad,
so they also know that no user data should be stored there.
Eventually, we replace the faulty disk, and it would probably be safe to
store user data in the marked sector (assuming the new drive is not
faulty on the same sector, and all other member drives are not faulty on
the same sector).
So, to "fix" this, we just need a way to tell MD to try and write to all
member drives, on all faulty sectors, and if any drive returns fails to
write, then keep the sector as marked bad, if *ALL* drives succeed, then
remove from the bad blocks list on all members.
So why not add this feature to fix the problem, instead of throwing away
something that is potentially useful? Perhaps this could be done as part
of the "repair" mode, or done during a replace/add (when we reach the
"bad" sector, test the new drive, test all existing drives, and then
continue with the repair/add.
Would that solve the "bug"?
PS, As you noted, if MD gets repeated write errors for one drive, then
it will be kicked out. That value is configurable.