Re: Feature request: Remove the badblocks list

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 3 Sep 2020 00:34:31 +1000

On 2/9/20 23:36, Roy Sigurd Karlsbakk wrote:
----- Original Message -----
From: "David C. Rankin" <drankinatty@xxxxxxxxxxxxxxxxxx>
To: "Linux Raid" <linux-raid@xxxxxxxxxxxxxxx>
Sent: Saturday, 22 August, 2020 03:42:40
Subject: Re: Feature request: Remove the badblocks list
On 8/18/20 4:03 PM, Håkon Struijk Holmen wrote:
Hi,

Thanks for the CC, I just managed to get myself subscribed to the list :)

I have gathered some thoughts on the subject as well after reading up on it,
figuring out the actual header format is, and writing a tool [3] to fix my
array...

<snip>
But I have some complaints about the thing..
Well,

  There is code in all things that can be fixed, but I for one will chime in
and say I don't care if a lose a strip or two so long as on a failed disk I
pop the new one in and it rebuilds without issue (which it does, even when the
disk was replaced due to bad blocks)

  So whatever is done, don't fix what isn't broken and introduce more bugs
along the way. If this is such an immediate problem, then why are patches
being attached to the complaints?
The problem is that it's already broken. Take a single mirror. One drive experiences a bad sector, fine, you have redundancy, so you read the data from the other drive and md flags the sector as bad. The drive two is replaced, you lose the data. The new drive will get flagged with the same sector number as faulty, since the first drive has it flagged. So you replace the first drive and during resync, it also gets flagged as having a bad sector. And so on.

Modern (that is, disks since 20 years ago or so) reallocate sectors as they wear out. We have redundancy to handle errors, not to pinpoint them on disks and fill up not-so-smart lists with broken sectors that work. If md sees a drive with excessive errors, that drive should be kicked out, marked as dead, but not interfere with the rest of the raid.

Vennlig hilsen

roy

I'm no MD expert, but I there are a couple of things to consider...

1) MD doesn't mark the sector as bad unless we try to write to it, AND 
the drive replies to say it could not be written. So, in your case, the 
drive is saying that it doesn't have any "spare" sectors left to 
re-allocate, we are already passed that point.

2) When MD tries to read, it gets an error, so read from the other 
mirror, or re-construct from parity/etc, and automatically attempt to 
write to the sector, see point 1 above for the failure case.

So by the time MD gets a write error for a sector, the drive really is 
bad, and MD can no longer ensure that *this* sector will be able to 
properly store data again (whatever level of RAID we asked for, that 
level can't be achieved with one drive faulty). So MD marks it bad, and 
won't store any user data in that sector in future. As other drives are 
replaced, we mark the corresponding sector on those drives as also bad, 
so they also know that no user data should be stored there.

Eventually, we replace the faulty disk, and it would probably be safe to 
store user data in the marked sector (assuming the new drive is not 
faulty on the same sector, and all other member drives are not faulty on 
the same sector).

So, to "fix" this, we just need a way to tell MD to try and write to all 
member drives, on all faulty sectors, and if any drive returns fails to 
write, then keep the sector as marked bad, if *ALL* drives succeed, then 
remove from the bad blocks list on all members.

So why not add this feature to fix the problem, instead of throwing away 
something that is potentially useful? Perhaps this could be done as part 
of the "repair" mode, or done during a replace/add (when we reach the 
"bad" sector, test the new drive, test all existing drives, and then 
continue with the repair/add.

Would that solve the "bug"?

PS, As you noted, if MD gets repeated write errors for one drive, then 
it will be kicked out. That value is configurable.