Just for discussion...
Proposal: md devices to have a badstripe table and space for re-allocation
Benefits:
Allows multiple block level failures on any combination of component md devices provided parity is not compromised.
Zero impact on performance in non-degraded mode.
No need for scanning (although it may be used as a trigger)
Works for all md personalities.
Overview:
Provide an 'on or off-array' store for any stripes impacted by block level failure.
Unlike a disk's badblock allocation this would be a temporary store since we'd insist on the underlying devices recovering fully from the problem before restoring full health.
This allows us to cope transiently and, in the event of non-recoverable errors, until the disk is replaced.
Downsides: Resync'ing with multiple failing drives is more complex (but more resilient) Some kind of store handler is needed.
Description:
I've structured this to look at the md driver, the userspace daemon, the store, failing drives and replacing and resync'ing drives.
md:
For normal md access the badstripe list has no entries and is ignored. A badstripe size check is required prior to each stripe access.
If a write error occurs, rewrite the stripe to a store noting, and marking bad, the originating (faulty) stripe (and offending device/block) in the badstripe table. The device is marked 'failing'.
If a read error occurs, attempt to reconstruct the stripe from the other devices then follow the write error path.
For normal md access against stripes appearing in the badstripe list:
* Lock the badstripe table against the daemon (and other md threads)
* Check the stripe is still in the bad stripe list
* If not then the userland daemon fixed it. Release lock. Carry on as normal.
* If so then read/write from the reserved area.
* Release badstripe lock.
Daemon:
A userland daemon could examine the reserved area, attempt a repair on a faulty stripe and, if it succeeds, could restore the stripe and mark the badstripe entry as clean thus freeing up the reserved area and restoring perfect health.
The daemon would:
* lock the badstripe table against md
* write the stripe back to the previously faulty area which shouldn't need locking against md since it's "not in use"
* correct the badstripe table
* release the lock
If the daemon fails then the badstripe entry is marked as unrecoverable.
If the daemon has failed to correct the error (unrecoverable in the badstripe table) then the drive should be kept as failing (not faulty) and should be replaced. The intention is to allow a failing drive to continue to be used in the event of a subsequent bad drive event.
The Store:
This could be reserved stripes at the start (?) of the component devices read/written using the current personality. Alternatively it could be a filesystem level store (possibly remote, on a resilient device or just in /tmp).
Failing drives:
From a reading point of view it seems possible to treat a failing drive as a faulty drive - until the event of another read failure on another drive. In that case the read error case above could still access the failing drive to attempt a recovery. This may help in the event of recovery from a failing drive where you want to minimise load against it. It may not be worthwhile.
Writing would still have to continue to maintain sync.
Drive replacement + resync:
If multiple devices go 'failing' then how are they removed (since they are all in use). A spare needs to be added and then the resync code needs to ensure that one of the failing disks is synced to the spare. Then the failing disk is made faulty and then removed.
This could be done by having a progression: failing failing-pending-remove faulty
As I said above a failing drive is not used for reads, only for writes.
Presumably a drive that is sync'ing is used for writes but not reads.
So if we add a good drive and mark it syncing and simultaneously mark the drive it replaces failing-pending-remove then the f-p-r drive won't be written to but is available for essential reads until the new drive is ready.
Some thoughts:
How much overhead is involved in checking each stripe read/write address against a *small* bad-stripe table. Probably none because most of the time, for a healthy md, the number of entries is 0.
Does the temporary space even have to be in the md space? Would it be easier to make it a file (not in the filesystem on the md device!!) This avoids any messing with stripe offsets etc.
I don't claim to understand md's locking - the stuff above is a simplistic start on the additional locking related to moving stuff in and out of the badstripes area. I don't know where contention is handled - md driver or fs.
This is essentially only useful for single (or at least 'few') badblock errors - is that a problem worth solving (from the thread title I assume so).
How intrusive is this? I can't really judge. It mainly feels like error handling - and maybe handing off to a reused/simplified loopback-like device could handle 'hits' against the reserved area.
I'm only starting to read the code/device drivers books etc etc so if I'm talking rubbish then I'll apologise for your time and keep quiet :)
David
Guy wrote:
Neil said: "I hadn't thought about that yet. I suspect there would be little point in doing a scan when there was no redundancy. However a scan on a degraded raid6 that could still safely loose one drive would probably make sense."
I agree.
Also a RAID1 with 2 or more working devices. Don't forget, some people have 3 or more devices on the RAID1 arrays. From what I have read anyway.
Thanks, Guy
-----Original Message-----
From: Neil Brown [mailto:neilb@xxxxxxxxxxxxxxx] Sent: Tuesday, November 16, 2004 6:04 PM
To: Guy
Cc: linux-raid@xxxxxxxxxxxxxxx
Subject: RE: Bad blocks are killing us!
On Tuesday November 16, bugzilla@xxxxxxxxxxxxxxxx wrote:
This sounds great!
But...
2/ Do you intend to create a user space program to attempt to correct the
bad block and put the device back in the array automatically? I
hope so.
Definitely. It would be added to the functionality of "mdadm --monitor".
If not, please consider correcting the bad block without kicking thedevice
out. Reason: Once the device is kicked out, a second bad block onanother
device is fatal to the array. And this has been happening a lot
lately.
This one of several things that makes it "a bit less trivial" than simply using the bitmap stuff. I will keep your comment in mind when I start looking at this in more detail. Thanks.
3/ Maybe don't do the bad block scan if the array is degraded. Reason:If
a bad block is found, that would kick out a second disk, which is fatal.The
Since the stated purpose of this is to "check parity/copies are correct"
then you probably can't do this anyway. I just want to be sure. Also, if
during the scan, if a device is kicked, the scan should pause or abort.
scan can resume once the array has been corrected. I would be happy ifthe
scan had to be restarted from the start. So a pause or abort is fine with
me.
I hadn't thought about that yet. I suspect there would be little point in doing a scan when there was no redundancy. However a scan on a degraded raid6 that could still safely loose one drive would probably make sense.
NeilBrown
- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html