Re: Feature request: Remove the badblocks list

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 9/2/20 6:32 PM, Adam Goryachev wrote:

On 3/9/20 01:25, Roy Sigurd Karlsbakk wrote:
I'd better want md to stop fixing "somebody else's problem", that is, the disk, and rather just do its job. As for the case, I have tried to manually read those sectors named in the badblocks list and they all work. All of them. But then, there's no fixing, since they are proclaimed dead. So are their siblings'
sectors with the same number, regardless of status.
Just because you can read them, doesn't mean you can write them.
Clearly, at some point in time, one of your drives failed. You now need
to recover from that failed drive in the most sensible way.
If a drive has multiple issues with bad sector, kick it out. It doesn't have
anything to do in the RAID anymore
And if a group of 100 sectors are bad on drive 1, and 100 different
sectors on drive 2, you want to kick both drives out, and destroy all
your data until you can create a new array and restore from backup?

OR, just mark those parts of all disks faulty, and at some point in the
future, you replace the disks, and then find a way to tell MD that the
sectors are working now (and preferably, re-test them before marking
them as OK)?

BTW, I just found this:

https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy
I linked to that earlier in the thread

Which suggests that there is indeed a bug which should be hunted and
fixed, and that actually the BBL isn't populated via failed writes, it
is populated by failed reads while doing a replace/add, AND the failed
read is from the source drive AND the parity/mirror drives.
It is neither hunted down nor fixed. It's the same thing and it has stayed the same for these years.
So what will you do now to change that? Obviously nobody else has had enough of a problem with it to be bothered to "hunt it down and fix it". Can you help hunt it down at least?
Either way, perhaps what is needed (if you are interested) is a
repeatable test scenario causing the problem, which could then be used
to identify and fix the bug.
I have tried several things and all show the same. I just don't know how to tell md "this drive's sector X is bad, so flag it so".

Again, this is not the way to walk around a problem. What this does is just hiding real problems and let them grow in generations instead of just flagging a bad drive as bad, since that's the originating problem here.

Vennlig hilsen

roy

Based in the linked page, you would need to do something like this:

1) Create a clean array with correctly working disks

2) Tell the underlying block device to pretend there is a read error on a specific sector of one disk

3) Ask MD to replace the "bad" block device with a "good" one

4) See what happens with the BBL

5) Various steps of reading/writing to that specific stripe, and document the outcome/behavior

6) Replace another drive, and document the results

Hint: there is a block device that could sit between your actual block device and MD, and it can "pretend" there are certain errors. The answers here seem to contain relevant information: https://stackoverflow.com/questions/1870696/simulate-a-faulty-block-device-with-read-errors

As I said, I suspect that if a reproducible error is found, then it should be easier to fix the bug.

OTOH, you could just remove the BBL from your arrays, and ensure you create new arrays without the BBL.

Regards,
Adam

Hi,

I think you may have misunderstood slightly. Bad blocks can get written based on failed read requests, which is the case that Roy and I are complaining about. Such a read error may just be temporary, and affect multiple drives if there is some sort of a controller problem.

I have actually done an experiment, and I would like to explain it in terms of your numbered points.


1) A NFS server was set up with a share and some block files were set up, approx 100MB in size for each. The NFS server was given a secondary IP address for the client, that could be added or removed to simulate a passing controller failure. The NFS client mapped up this with a soft mount, allowing it to give IO errors after a timeout. The files were mapped to loopback blocks and a raid array was created, I think it was raid 5. The array was formatted to xfs and filled with data. Caches were wiped.

2) The IP was removed to simulate the controller temporarily failing. Then I tried reading from the raid array, producing io errors on all the drives. The IP was added back in to restore communication, and md took the opportunity to write one of the drives full of bad blocks. The rest of the block devices were thrown out, maybe for failing to write to the bad block list.

3) My attempt wasn't entirely successful, since only one drive got bad blocks. I think this was out of luck. In this case md will have enough data to repair the error during a drive replacement. Maybe if one of the "healthy" ones were removed, then we would see md failing to reconstruct data and writing bad blocks to the new device. I didn't carry this out, but I understand the algorithm to work like that.


The issue I have is that a temporary read failure can cause blocks to be marked with a flag that means "the data here is not the correct data". It would be necessary to handle read failures differently to have a distinction and be able to retry reading from these types of bad blocks. There's just one flag, and it's used if reading fails, if writing fails, if the correct data was not found for a new drive and thus the data was not initialized...

I've talked to Roy and we will probably try removing the lists, and I think it will work. At least partially. For his array, he has been replacing some drives from time to time without knowing about the bad block lists, and this means that his bad blocks are a combination of drives where the data actually is present, and drives where the data was never written in the first place. If we remove the lists, then we will probably get a mix of uninitialized data and correct data back. I did the same to my array, but I did not replace any drives so I was certain that I had all the data. My drives actually don't have any bad blocks at all, I iterated the lists and read all of the sectors.

I would expect md to state that the array is degraded, send angry emails and such, but it seems like you will only know the state of your BBLs if you go and check them.


Regards and thanks for understanding,
Håkon



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux