On 8/18/20 8:00 PM, Roy Sigurd Karlsbakk wrote:
Hi all
Hi, Thanks for the CC, I just managed to get myself subscribed to the list :) I have gathered some thoughts on the subject as well after reading up on it, figuring out the actual header format is, and writing a tool [3] to fix my array... About the tool, it will try to read all the supposedly bad blocks in a drive, and erase the whole list if no blocks fail to read. As long as it's not run against a drive that got marked because data was unavailable during rebuild, this should make it possible to read the data again. A possible improvement here, would be to reduce the size of the list down to the actually bad blocks if some still fail, but right now the tool will refuse to do anything to the drive if md was correct. You also need to flip a variable that I hid awkwardly between two functions before it will write to the drive at all. But I have some complaints about the thing.. Good data marked as bad: My viewpoint is from what happened to my raid array, a 5 drive raid6. 3 of the drives had identical lists of bad blocks while 2 had empty lists. Therefore, the marked sectors corresponded to lost data. This was solved by iterating the bad block list, verifying that all the sectors were in fact readable, and then removing the bad block list. Since I did not have any drive replacements, I was certain enough that I would not run into uninitialized space. This gave me back the data that md had decided was gone. I do not really think one can say that the md badblock list corresponds to bad blocks on the device. The lists consists of sectors where md thinks the data is permanently unavailable. It happens in two ways: - A read error occurs for any reason - A new drive is rebuilt, but the array doesn't have the parity to find out what data was supposed to go there, because badblock entries for other devices prevents it from finding a source for the data that it's supposed to write there. It's assumed that such reads would fail. Since these are added to the same list of bad blocks, it follows that even if you were to have a successful read from a bad block, it can also be uninitialized space. Once enough drives have bad blocks for the same stripe, that data is now gone. md will not read it. Even if it's there on the drives. I can only speculate on what happened in my case, so far I think that some intermittent controller failure caused any reads to give errors, and somehow md was still able to write to the badblock list. I think it's not just me, and it seems like it's a common phenomenon that arrays end up with identical lists across drives. Be this controller failures, or just a bug, it's not good and undermines the assumption that the underlying blocks will actually be bad. 9 years ago, Lutz Vieweg asked "I've experienced drives with intermittent read / write failures (due to controller or power stability problems), and I wonder whether such a situation could quickly fill up the "bad block list", doing more harm than good in the "intermittent error"- szenario." [1]. I have my doubts that this was resolved. I also don't know if this is the cause of the issue with many drives sharing the exact same list, or if some other logic error type bug is causing it. md indicating all is good: In the same URL, Neil Brown said "(...) You shouldn't aim to run an array with bad blocks any more than you should run an array degraded. The purpose of bad block management is to provide a more graceful failure path, not to encourage you to run an array with bad drives". However, an array with bad blocks does not report this as "degraded", and you have to run --examine to even see it. The result is that the array is not being treated as bad, having md communicate that the array is still good. The end result being, the software encouraging running an array with bad drives. If the assumption was that one would treat this as a degraded array. But you have to --examine and specifically look for it to see that there are bad blocks. Lack of documentation: I have added some links in addition to the ones found by Roy. This was the extent of the documentation that I was able to find. I'll be interested if this is documented better elsewhere. The kernel documentation also briefly mentioned the existence of bad block lists in mdraid. The wiki article on the superblock format [5] hasn't been updated with the badblock fields. The 2010 blog post [4] was the closest thing to documentation, even if it was written before the thing was finalized. Overall: I don't think this uncertainty is good at all. I feel like it would be easier to deal with a controller failure throwing the whole raid apart. You'd assemble it back together and check the filesystem, and with fingers crossed, everything will be fine. I think one finds that it makes sense how md acts without this algorithm enabled. Drives thrown out of arrays still have data on them. This means that if unrecoverable errors occur, one can still run ddrescue and try to copy the array to new drives, one by one and get as much data back as possible. Once you hit a bad block during read, adopting the zfs model of calculating parity and overwriting seems better because it tries to just solve the problem so it doesn't happen the next time around. I think md will throw it in the list and expect it to be fixed during the next check operation. Unless that doesn't happen and more bad blocks accumulate until data loss happens.. I would also like to see the functionality changed to opt-in or just removed. If it's kept as opt-in, it still hope that some of this feedback is taken. For example, reporting the array as degraded if the lists get populated. Automatically fixing bad blocks as soon as possible, before the situation develops any further. Making the uninitialized data and the bad blocks two separate things, so that one can still try reading those blocks and keep track of where the data is supposed to be, and where it's definitely not. Maybe dropping badblocks and taking inspiration from ZFS instead.
[1] https://linux-raid.vger.kernel.narkive.com/R1rvkUiQ/using-the-new-bad-block-log-in-md-for-linux-3-1 [2] https://raid.wiki.kernel.org/index.php/The_Badblocks_controversy [3] https://git.thehawken.org/hawken/md-badblocktool.git
[4]https://neil.brown.name/blog/20100519043730
[5]https://raid.wiki.kernel.org/index.php/RAID_superblock_formats Regards, Håkon