Re: Feature request: Remove the badblocks list

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 3 Sep 2020 01:09:06 +1000

On 3/9/20 00:50, Roy Sigurd Karlsbakk wrote:
I'm no MD expert, but I there are a couple of things to consider...

1) MD doesn't mark the sector as bad unless we try to write to it, AND
the drive replies to say it could not be written. So, in your case, the
drive is saying that it doesn't have any "spare" sectors left to
re-allocate, we are already passed that point.

2) When MD tries to read, it gets an error, so read from the other
mirror, or re-construct from parity/etc, and automatically attempt to
write to the sector, see point 1 above for the failure case.

So by the time MD gets a write error for a sector, the drive really is
bad, and MD can no longer ensure that *this* sector will be able to
properly store data again (whatever level of RAID we asked for, that
level can't be achieved with one drive faulty). So MD marks it bad, and
won't store any user data in that sector in future. As other drives are
replaced, we mark the corresponding sector on those drives as also bad,
so they also know that no user data should be stored there.

Eventually, we replace the faulty disk, and it would probably be safe to
store user data in the marked sector (assuming the new drive is not
faulty on the same sector, and all other member drives are not faulty on
the same sector).

So, to "fix" this, we just need a way to tell MD to try and write to all
member drives, on all faulty sectors, and if any drive returns fails to
write, then keep the sector as marked bad, if *ALL* drives succeed, then
remove from the bad blocks list on all members.

So why not add this feature to fix the problem, instead of throwing away
something that is potentially useful? Perhaps this could be done as part
of the "repair" mode, or done during a replace/add (when we reach the
"bad" sector, test the new drive, test all existing drives, and then
continue with the repair/add.

Would that solve the "bug"?
I'd better want md to stop fixing "somebody else's problem", that is, the disk, and rather just do its job. As for the case, I have tried to manually read those sectors named in the badblocks list and they all work. All of them. But then, there's no fixing, since they are proclaimed dead. So are their siblings' sectors with the same number, regardless of status.
Just because you can read them, doesn't mean you can write them. 
Clearly, at some point in time, one of your drives failed. You now need 
to recover from that failed drive in the most sensible way.
If a drive has multiple issues with bad sector, kick it out. It doesn't have anything to do in the RAID anymore

And if a group of 100 sectors are bad on drive 1, and 100 different 
sectors on drive 2, you want to kick both drives out, and destroy all 
your data until you can create a new array and restore from backup?

OR, just mark those parts of all disks faulty, and at some point in the 
future, you replace the disks, and then find a way to tell MD that the 
sectors are working now (and preferably, re-test them before marking 
them as OK)?

BTW, I just found this:

https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy

Which suggests that there is indeed a bug which should be hunted and 
fixed, and that actually the BBL isn't populated via failed writes, it 
is populated by failed reads while doing a replace/add, AND the failed 
read is from the source drive AND the parity/mirror drives.

Either way, perhaps what is needed (if you are interested) is a 
repeatable test scenario causing the problem, which could then be used 
to identify and fix the bug.

Regards,
Adam